[ 
https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489258#comment-17489258
 ] 

zhengruifeng commented on SPARK-36714:
--------------------------------------

[~sheng_1992] Since you had investigate this issue, feel free to send a PR and 
ping me in it

> bugs in MIniLSH
> ---------------
>
>                 Key: SPARK-36714
>                 URL: https://issues.apache.org/jira/browse/SPARK-36714
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: shengzhang
>            Priority: Minor
>
> This is about MinHashLSH algorithm.
> To get the similartiy dataframe DFA and DFB, I  used MinHashLSH  
> approxSimilarityJoin function.  But there are some missing data in the result.
> the example in documents is no problem  
> [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com]
>  
> but when the data based on distributed system(hive, more than one node)
> there will be some missing data. 
> for example    vectora= vectorb. but it no in the reslut of  
> approxSimilarityJoin, even though 
> "threshold"  more than 1 .
> I think  maybe the problem is  in these codes
> {code:java}
> // part1
> override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 
> = {
>   require(inputDim <= MinHashLSH.HASH_PRIME,
>     s"The input vector dimension $inputDim exceeds the threshold 
> ${MinHashLSH.HASH_PRIME}.")
>   val rand = new Random($(seed)) 
>   val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) {
>     (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), 
> rand.nextInt(MinHashLSH.HASH_PRIME - 1))
>   }
>   new MinHashLSHModel1(uid, randCoefs)
> }
> // part2
> @Since("2.1.0")
> override protected[ml] val hashFunction: Vector => Array[Vector] = {
>   elems: Vector => {
>     require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")
>     val elemsList = elems.toSparse.indices.toList
>     val hashValues = randCoefficients.map { case (a, b) =>
>       elemsList.map { elem: Int =>
>         ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME
>       }.min.toDouble
>     }
>     // TODO: Output vectors of dimension numHashFunctions in SPARK-18450
>     hashValues.map(Vectors.dense(_))
>   }
> {code}
>  val r1 = new scala.util.Random(1)
> r1.nextInt(1000)  // -> 985
> val r2 = new scala.util.Random(2)
> r2.nextInt(1000)  // -> 108 - 
> val r3 = new scala.util.Random(1)
> r3.nextInt(1000)  // -> 985 - because seeded just as `r1`
> r3.nextInt(1000).  //-> 588
> {{}}the reason maybe is above.  if  random is only  initialized once .  
> random.nextInt() will get different result every time ,like r3. 
> r3.nextInt(1000) // -> 985   r3.nextInt(1000).  //-> 588
> so the code 
> val rand = new Random($(seed)) in  def createRawLSHModel  move to 
> hashFunction is better
> . every worker will initialize random class. and every worker get same data
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to