[ 
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080575#comment-18080575
 ] 

Vũ Trần Phúc commented on SPARK-36458:
--------------------------------------

Hi, I'd like to take on this issue.

*Problem:* Currently, {{approxSimilarityJoin}} and {{approxNearestNeighbors}} 
in {{LSHModel}} always reference {{inputCol}} to compute exact distance via 
{{{}keyDistance{}}}. This forces users to keep the original feature column in 
their datasets, even when they have already transformed and only need the 
{{hashes}} (outputCol) — defeating the purpose of pre-computing hashes to 
reduce data size.

*Proposed fix:*
 # In {{{}approxSimilarityJoin{}}}: check if both {{datasetA}} and {{datasetB}} 
contain {{{}inputCol{}}}. If yes, use the original exact {{{}keyDistance{}}}. 
If not, fall back to approximate distance using {{hashDistance}} on 
{{{}outputCol{}}}.
 # In {{{}approxNearestNeighbors{}}}: same conditional — if {{inputCol}} 
exists, use {{{}keyDistance{}}}; otherwise use {{{}hashDistance{}}}.
 # For self-join: only call {{recreateCol}} on {{inputCol}} when it actually 
exists in the dataset.

> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-36458
>                 URL: https://issues.apache.org/jira/browse/SPARK-36458
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.1
>            Reporter: Thai Thien
>            Priority: Minor
>
> Refer to documents and example code in MinHashLSH 
>  
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
> We could avoid computing hashes by passing in the already-transformed 
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
> However, inputCol still required in transformedA and transformedB even if 
> they already have outputCol.
> A code that should work but it doesn't
>  
>  
> {code:java}
> from pyspark.ml.feature import MinHashLSH
>  from pyspark.ml.linalg import Vectors
>  from pyspark.sql.functions import col
> dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
>  (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
>  (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfA = spark.createDataFrame(dataA, ["id", "features"])
> dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
>  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
>  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfB = spark.createDataFrame(dataB, ["id", "features"])
> key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
> mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
>  model = mh.fit(dfA)
> transformedA = model.transform(dfA).select("id", "hashes")
>  transformedB = model.transform(dfB).select("id", "hashes")
> model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
> distCol="JaccardDistance")\
>  .select(col("datasetA.id").alias("idA"),
>  col("datasetB.id").alias("idB"),
>  col("JaccardDistance")).show()
> {code}
> As in the code I give, I discard columns `features` but keep column `hashes` 
> which is output data. 
>  approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
> exist and ignore the lack of `features` (the inputCol).
> Be able to transform the data beforehand and remove inputCol can make input 
> data much smaller and prevent confusion about the tip "_We could avoid 
> computing hashes by passing in the already-transformed dataset_".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to