[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202003#comment-15202003
 ] 

Joseph K. Bradley commented on SPARK-13968:
-------------------------------------------

I'm going to close this in favor of the older ticket.  I'll make the old ticket 
a subtask.  But I agree it'd be good to switch.

> Use MurmurHash3 for hashing String features
> -------------------------------------------
>
>                 Key: SPARK-13968
>                 URL: https://issues.apache.org/jira/browse/SPARK-13968
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Nick Pentreath
>            Assignee: Yanbo Liang
>            Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to