[jira] [Commented] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?
[ https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608903#comment-16608903 ] Hyukjin Kwon commented on SPARK-25365: -- Questions should go to mailing list. Please see https://spark.apache.org/community.html. I believe you could have a better answer. > a better way to handle vector index and sparsity in FeatureHasher > implementation ? > -- > > Key: SPARK-25365 > URL: https://issues.apache.org/jira/browse/SPARK-25365 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be updated with the sum of > current and old value, ie, the value of the conflicted feature vector would > be change by this module. > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training > we are working on fixing these problems due to our business need, thinking it > might or might not be an issue for others as well, we'd like to hear from the > community. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?
[ https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606746#comment-16606746 ] Vincent commented on SPARK-25365: - [~nick.pentre...@gmail.com] Thanks. > a better way to handle vector index and sparsity in FeatureHasher > implementation ? > -- > > Key: SPARK-25365 > URL: https://issues.apache.org/jira/browse/SPARK-25365 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.1 >Reporter: Vincent >Priority: Major > > In the current implementation of FeatureHasher.transform, a simple modulo on > the hashed value is used to determine the vector index, it's suggested to use > a large integer value as the numFeature parameter > we found several issues regarding current implementation: > # Cannot get the feature name back by its index after featureHasher > transform, for example. when getting feature importance from decision tree > training followed by a FeatureHasher > # when index conflict, which is a great chance to happen especially when > 'numFeature' is relatively small, its value would be updated with the sum of > current and old value, ie, the value of the conflicted feature vector would > be change by this module. > # to avoid confliction, we should set the 'numFeature' with a large number, > highly sparse vector increase the computation complexity of model training > we are working on fixing these problems due to our business need, thinking it > might or might not be an issue for others as well, we'd like to hear from the > community. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org