Joseph K. Bradley created SPARK-23469:
-----------------------------------------

             Summary: HashingTF should use corrected MurmurHash3 implementation
                 Key: SPARK-23469
                 URL: https://issues.apache.org/jira/browse/SPARK-23469
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.4.0
            Reporter: Joseph K. Bradley


[SPARK-23381] added a corrected MurmurHash3 implementation but left the old 
implementation alone.  In Spark 2.3 and earlier, HashingTF will use the old 
implementation.  (We should not backport a fix for HashingTF since it would be 
a major change of behavior.)  But we should correct HashingTF in Spark 2.4; 
this JIRA is for tracking this fix.
* Update HashingTF to use new implementation of MurmurHash3
* Ensure backwards compatibility for ML persistence by having HashingTF use the 
old MurmurHash3 when a model from Spark 2.3 or earlier is loaded.  We can add a 
Param to allow this.

Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I 
recommend we first migrate the code to spark.ml.  We can leave spark.mllib 
alone and just fix MurmurHash3 in spark.ml.  I will link a JIRA for this 
migration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to