[spark] branch master updated: [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel

srowen Sat, 15 Dec 2018 06:43:33 -0800

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 860f449  [SPARK-26315][PYSPARK] auto cast threshold from Integer to 
Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
860f449 is described below

commit 860f4497f2a59b21d455ec8bfad9ae15d2fd4d2e
Author: Jing Chen He <jin...@us.ibm.com>
AuthorDate: Sat Dec 15 08:41:16 2018 -0600

    [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in 
approxSimilarityJoin of BucketedRandomProjectionLSHModel
    
    ## What changes were proposed in this pull request?
    
    If the input parameter 'threshold' to the function approxSimilarityJoin is 
not a float, we would get an exception.  The fix is to convert the 'threshold' 
into a float before calling the java implementation method.
    
    ## How was this patch tested?
    
    Added a new test case.  Without this fix, the test will throw an exception 
as reported in the JIRA. With the fix, the test passes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Closes #23313 from jerryjch/SPARK-26315.
    
    Authored-by: Jing Chen He <jin...@us.ibm.com>
    Signed-off-by: Sean Owen <sean.o...@databricks.com>
---
 python/pyspark/ml/feature.py | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index c9507c2..08ae582 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -192,6 +192,7 @@ class LSHModel(JavaModel):
                  "datasetA" and "datasetB", and a column "distCol" is added to 
show the distance
                  between each pair.
         """
+        threshold = TypeConverters.toFloat(threshold)
         return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
 
 
@@ -239,6 +240,16 @@ class BucketedRandomProjectionLSH(JavaEstimator, 
LSHParams, HasInputCol, HasOutp
     |  3|  6| 2.23606797749979|
     +---+---+-----------------+
     ...
+    >>> model.approxSimilarityJoin(df, df2, 3, 
distCol="EuclideanDistance").select(
+    ...     col("datasetA.id").alias("idA"),
+    ...     col("datasetB.id").alias("idB"),
+    ...     col("EuclideanDistance")).show()
+    +---+---+-----------------+
+    |idA|idB|EuclideanDistance|
+    +---+---+-----------------+
+    |  3|  6| 2.23606797749979|
+    +---+---+-----------------+
+    ...
     >>> brpPath = temp_path + "/brp"
     >>> brp.save(brpPath)
     >>> brp2 = BucketedRandomProjectionLSH.load(brpPath)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-26315][PYSPARK] auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel

Reply via email to