Vincent created SPARK-25412: ------------------------------- Summary: FeatureHasher would change the value of output feature Key: SPARK-25412 URL: https://issues.apache.org/jira/browse/SPARK-25412 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.1 Reporter: Vincent
In the current implementation of FeatureHasher.transform, a simple modulo on the hashed value is used to determine the vector index, it's suggested to use a large integer value as the numFeature parameter we found several issues regarding current implementation: # Cannot get the feature name back by its index after featureHasher transform, for example. when getting feature importance from decision tree training followed by a FeatureHasher # when index conflict, which is a great chance to happen especially when 'numFeature' is relatively small, its value would be changed with a new valued (sum of current and old value) # to avoid confliction, we should set the 'numFeature' with a large number, highly sparse vector increase the computation complexity of model training -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org