[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201378#comment-15201378
 ] 

Nick Pentreath commented on SPARK-13968:
----------------------------------------

[~yanboliang] Actually I think this one would be uncontroversial enough to 
start on if you'd like to ... it would be interesting to do some comparisons on 
performance in terms of (a) time and (b) hash collision rate between old 
approach and using MurmurHash

> Use MurmurHash3 for hashing String features
> -------------------------------------------
>
>                 Key: SPARK-13968
>                 URL: https://issues.apache.org/jira/browse/SPARK-13968
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Nick Pentreath
>            Assignee: Yanbo Liang
>            Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to