[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613160#comment-16613160
 ] 

Nick Pentreath commented on SPARK-25412:
----------------------------------------

(1) is by design. Feature hashing does not store the exact mapping from feature 
values to vector indices and so is a one way transform. Hashing gives you speed 
and requires almost no memory, but you give up the reverse mapping and you have 
the potential for hash collisions.

(2) is again by design for now. There are ways to have the sign of the feature 
value be determined also as part of a hash function, and in expectation the 
collisions zero each other out. This may be added in future work.

The impact of hash collisions can be reduced by increasing the {{numFeatures}} 
parameter. The default is probably reasonable for small to medium feature 
dimensions but should probably be increased when working with very 
high-cardinality features.

 

I don't think this can be classed as a bug as these are all design and 
tradeoffs of using feature hashing 

> FeatureHasher would change the value of output feature
> ------------------------------------------------------
>
>                 Key: SPARK-25412
>                 URL: https://issues.apache.org/jira/browse/SPARK-25412
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.1
>            Reporter: Vincent
>            Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to