[ https://issues.apache.org/jira/browse/SPARK-36714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17489258#comment-17489258 ]
zhengruifeng commented on SPARK-36714: -------------------------------------- [~sheng_1992] Since you had investigate this issue, feel free to send a PR and ping me in it > bugs in MIniLSH > --------------- > > Key: SPARK-36714 > URL: https://issues.apache.org/jira/browse/SPARK-36714 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.1 > Reporter: shengzhang > Priority: Minor > > This is about MinHashLSH algorithm. > To get the similartiy dataframe DFA and DFB, I used MinHashLSH > approxSimilarityJoin function. But there are some missing data in the result. > the example in documents is no problem > [https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance|http://example.com] > > but when the data based on distributed system(hive, more than one node) > there will be some missing data. > for example vectora= vectorb. but it no in the reslut of > approxSimilarityJoin, even though > "threshold" more than 1 . > I think maybe the problem is in these codes > {code:java} > // part1 > override protected[ml] def createRawLSHModel(inputDim: Int): MinHashLSHModel1 > = { > require(inputDim <= MinHashLSH.HASH_PRIME, > s"The input vector dimension $inputDim exceeds the threshold > ${MinHashLSH.HASH_PRIME}.") > val rand = new Random($(seed)) > val randCoefs: Array[(Int, Int)] = Array.fill($(numHashTables)) { > (1 + rand.nextInt(MinHashLSH.HASH_PRIME - 1), > rand.nextInt(MinHashLSH.HASH_PRIME - 1)) > } > new MinHashLSHModel1(uid, randCoefs) > } > // part2 > @Since("2.1.0") > override protected[ml] val hashFunction: Vector => Array[Vector] = { > elems: Vector => { > require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.") > val elemsList = elems.toSparse.indices.toList > val hashValues = randCoefficients.map { case (a, b) => > elemsList.map { elem: Int => > ((1 + elem) * a + b) % MinHashLSH.HASH_PRIME > }.min.toDouble > } > // TODO: Output vectors of dimension numHashFunctions in SPARK-18450 > hashValues.map(Vectors.dense(_)) > } > {code} > val r1 = new scala.util.Random(1) > r1.nextInt(1000) // -> 985 > val r2 = new scala.util.Random(2) > r2.nextInt(1000) // -> 108 - > val r3 = new scala.util.Random(1) > r3.nextInt(1000) // -> 985 - because seeded just as `r1` > r3.nextInt(1000). //-> 588 > {{}}the reason maybe is above. if random is only initialized once . > random.nextInt() will get different result every time ,like r3. > r3.nextInt(1000) // -> 985 r3.nextInt(1000). //-> 588 > so the code > val rand = new Random($(seed)) in def createRawLSHModel move to > hashFunction is better > . every worker will initialize random class. and every worker get same data > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org