[ 
https://issues.apache.org/jira/browse/SPARK-23469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23469:
------------------------------
    Docs Text: In Spark 3.0, the HashingTF Transformer uses a corrected 
implementation of the murmur3 hash function to hash elements to vectors. 
HashingTF fit with Spark 3.0 will map elements to different positions in 
vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded 
with Spark 3.0 will still use the previous hash function and will not change 
behavior.
       Labels: release-notes  (was: )

> HashingTF should use corrected MurmurHash3 implementation
> ---------------------------------------------------------
>
>                 Key: SPARK-23469
>                 URL: https://issues.apache.org/jira/browse/SPARK-23469
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.4.0
>            Reporter: Joseph K. Bradley
>            Priority: Major
>              Labels: release-notes
>
> [SPARK-23381] added a corrected MurmurHash3 implementation but left the old 
> implementation alone.  In Spark 2.3 and earlier, HashingTF will use the old 
> implementation.  (We should not backport a fix for HashingTF since it would 
> be a major change of behavior.)  But we should correct HashingTF in Spark 
> 2.4; this JIRA is for tracking this fix.
> * Update HashingTF to use new implementation of MurmurHash3
> * Ensure backwards compatibility for ML persistence by having HashingTF use 
> the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded.  We can 
> add a Param to allow this.
> Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I 
> recommend we first migrate the code to spark.ml: [SPARK-21748].  We can leave 
> spark.mllib alone and just fix MurmurHash3 in spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to