Thank for you reporting this issue. I think this is a bug regarding 
integer overflow. IMHO, it would be good to compute hashes with Long.

Would it be possible to create a JIRA entry?  Do you want to submit a pull 
request, too?

Regards,
Kazuaki Ishizaki



From:   jiayuanm <jiayuanm...@gmail.com>
To:     dev@spark.apache.org
Date:   2018/07/07 10:36
Subject:        [SPARK ML] Minhash integer overflow



Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: 
http://apache-spark-developers-list.1001551.n3.nabble.com/


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Reply via email to