[ 
https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530597#comment-14530597
 ] 

Felix Neutatz commented on FLINK-1735:
--------------------------------------

Hi,

my group at Alexander's IMPRO3 course at TU-Berlin is currently implementing 
this. At the moment we have a running prototype which is based on the 
implementation of the StandardScaler. 

So is this the right approach or shall we implement it in the feature package 
by the example of PolynomialBase.

Moreover, is there already test data for this scenario.

Another question is whether the result should be a SparseVector or a 
DenseVector - or should we even implement a smart way to figure that out.

You can find the current prototype in my repository: 
https://github.com/FelixNeutatz/incubator-flink/blob/3582d1f858bab5d254267d427e29cff9559a7b8a/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/FeatureHasher.scala

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Alexander Alexandrov
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature 
> values. The hash of the feature value is used to calculate its index for a 
> vector entry. In order to mitigate possible collisions, a second hashing 
> function is used to calculate the sign for the update value which is added to 
> the vector entry. This way, it is likely that collision will simply cancel 
> out.
> A feature hasher would also be helpful for NLP problems where it could be 
> used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] 
> [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to