Joseph K. Bradley created SPARK-23469: -----------------------------------------
Summary: HashingTF should use corrected MurmurHash3 implementation Key: SPARK-23469 URL: https://issues.apache.org/jira/browse/SPARK-23469 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.4.0 Reporter: Joseph K. Bradley [SPARK-23381] added a corrected MurmurHash3 implementation but left the old implementation alone. In Spark 2.3 and earlier, HashingTF will use the old implementation. (We should not backport a fix for HashingTF since it would be a major change of behavior.) But we should correct HashingTF in Spark 2.4; this JIRA is for tracking this fix. * Update HashingTF to use new implementation of MurmurHash3 * Ensure backwards compatibility for ML persistence by having HashingTF use the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded. We can add a Param to allow this. Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I recommend we first migrate the code to spark.ml. We can leave spark.mllib alone and just fix MurmurHash3 in spark.ml. I will link a JIRA for this migration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org