[
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080575#comment-18080575
]
Vũ Trần Phúc commented on SPARK-36458:
--------------------------------------
Hi, I'd like to take on this issue.
*Problem:* Currently, {{approxSimilarityJoin}} and {{approxNearestNeighbors}}
in {{LSHModel}} always reference {{inputCol}} to compute exact distance via
{{{}keyDistance{}}}. This forces users to keep the original feature column in
their datasets, even when they have already transformed and only need the
{{hashes}} (outputCol) — defeating the purpose of pre-computing hashes to
reduce data size.
*Proposed fix:*
# In {{{}approxSimilarityJoin{}}}: check if both {{datasetA}} and {{datasetB}}
contain {{{}inputCol{}}}. If yes, use the original exact {{{}keyDistance{}}}.
If not, fall back to approximate distance using {{hashDistance}} on
{{{}outputCol{}}}.
# In {{{}approxNearestNeighbors{}}}: same conditional — if {{inputCol}}
exists, use {{{}keyDistance{}}}; otherwise use {{{}hashDistance{}}}.
# For self-join: only call {{recreateCol}} on {{inputCol}} when it actually
exists in the dataset.
> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> ----------------------------------------------------------------------------
>
> Key: SPARK-36458
> URL: https://issues.apache.org/jira/browse/SPARK-36458
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 3.1.1
> Reporter: Thai Thien
> Priority: Minor
>
> Refer to documents and example code in MinHashLSH
>
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
> We could avoid computing hashes by passing in the already-transformed
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
> However, inputCol still required in transformedA and transformedB even if
> they already have outputCol.
> A code that should work but it doesn't
>
>
> {code:java}
> from pyspark.ml.feature import MinHashLSH
> from pyspark.ml.linalg import Vectors
> from pyspark.sql.functions import col
> dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
> (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
> (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
> dfA = spark.createDataFrame(dataA, ["id", "features"])
> dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
> (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
> (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
> dfB = spark.createDataFrame(dataB, ["id", "features"])
> key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
> mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
> model = mh.fit(dfA)
> transformedA = model.transform(dfA).select("id", "hashes")
> transformedB = model.transform(dfB).select("id", "hashes")
> model.approxSimilarityJoin(transformedA, transformedB, 0.6,
> distCol="JaccardDistance")\
> .select(col("datasetA.id").alias("idA"),
> col("datasetB.id").alias("idB"),
> col("JaccardDistance")).show()
> {code}
> As in the code I give, I discard columns `features` but keep column `hashes`
> which is output data.
> approxSimilarityJoin should only work on `hashes` (the outputCol), which is
> exist and ignore the lack of `features` (the inputCol).
> Be able to transform the data beforehand and remove inputCol can make input
> data much smaller and prevent confusion about the tip "_We could avoid
> computing hashes by passing in the already-transformed dataset_".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]