Re: [SPARK ML] Minhash integer overflow
Of course, the hash value can just be negative. I thought that it would be after computation without overflow. When I checked another implementation, it performs computations with int. https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89 By copy to Xjiayuan, did you compare the hash value generated by Spark with it generated by other implementations? Regards, Kazuaki Ishizaki From: Sean Owen To: jiayuanm Cc: dev@spark.apache.org Date: 2018/07/07 15:46 Subject:Re: [SPARK ML] Minhash integer overflow I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [SPARK ML] Minhash integer overflow
I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: > Hi everyone, > > I was playing around with LSH/Minhash module from spark ml module. I > noticed > that hash computation is done with Int (see > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 > ). > Since "a" and "b" are from a uniform distribution of [1, > MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, > it's likely for the multiplication to cause Int overflow with a large > sparse > input vector. > > I wonder if this is a bug or intended. If it's a bug, one way to fix it is > to compute hashes with Long and insert a couple of mod > MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be > smaller > than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is > to use BigInteger. > > Let me know what you think. > > Thanks, > Jiayuan > > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: [SPARK ML] Minhash integer overflow
Sure. JIRA ticket is here: https://issues.apache.org/jira/browse/SPARK-24754. I'll create the PR. -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [SPARK ML] Minhash integer overflow
Thank for you reporting this issue. I think this is a bug regarding integer overflow. IMHO, it would be good to compute hashes with Long. Would it be possible to create a JIRA entry? Do you want to submit a pull request, too? Regards, Kazuaki Ishizaki From: jiayuanm To: dev@spark.apache.org Date: 2018/07/07 10:36 Subject:[SPARK ML] Minhash integer overflow Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[SPARK ML] Minhash integer overflow
Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org