[ https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Pentreath resolved SPARK-25412. ------------------------------------ Resolution: Not A Bug > FeatureHasher would change the value of output feature > ------------------------------------------------------ > > Key: SPARK-25412 > URL: https://issues.apache.org/jira/browse/SPARK-25412 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.1 > Reporter: Vincent > Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be changed with a new > valued (sum of current and old value) > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org