[ 
https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200254#comment-15200254
 ] 

Nick Pentreath commented on SPARK-13968:
----------------------------------------

Sure, I will assign to you. But I'd like to get some thoughts from [~mengxr] 
and [~josephkb] about this and the umbrella for feature hashing improvements 
(especially around the API / transformer behaviour) before starting work on 
these tickets.

> Use MurmurHash3 for hashing String features
> -------------------------------------------
>
>                 Key: SPARK-13968
>                 URL: https://issues.apache.org/jira/browse/SPARK-13968
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Nick Pentreath
>            Priority: Minor
>
> Typically feature hashing is done on strings, i.e. feature names (or in the 
> case of raw feature indexes, either the string representation of the 
> numerical index can be used, or the index used "as-is" and not hashed).
> It is common to use a well-distributed hash function such as MurmurHash3. 
> This is the case in e.g. 
> [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher].
> Currently Spark's {{HashingTF}} uses the object's hash code. Look at using 
> MurmurHash3 (at least for {{String}} which is the common case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to