[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10574:
----------------------------------
    Assignee: Yanbo Liang

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Assignee: Yanbo Liang
>            Priority: Critical
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to