[ https://issues.apache.org/jira/browse/SPARK-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199639#comment-15199639 ]
Yanbo Liang commented on SPARK-13968: ------------------------------------- [~mlnick] Can I work on this? > Use MurmurHash3 for hashing String features > ------------------------------------------- > > Key: SPARK-13968 > URL: https://issues.apache.org/jira/browse/SPARK-13968 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib > Reporter: Nick Pentreath > Priority: Minor > > Typically feature hashing is done on strings, i.e. feature names (or in the > case of raw feature indexes, either the string representation of the > numerical index can be used, or the index used "as-is" and not hashed). > It is common to use a well-distributed hash function such as MurmurHash3. > This is the case in e.g. > [Scikit-learn|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher]. > Currently Spark's {{HashingTF}} uses the object's hash code. Look at using > MurmurHash3 (at least for {{String}} which is the common case). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org