[ 
https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532386#comment-14532386
 ] 

Christoph Alt commented on FLINK-1735:
--------------------------------------

Hi,

I'm also working with Felix on this issue. I developed an initial prototype 
taking the implementation of scikit-learn as a reference. Scikit only supports 
strings as categorical values, either as a sequence or bag of words.

We can't use Transformer[Vector, Vecor] because Vector doesn't support 
arbitrary basic types.
I don't know whether there are plans to integrate a tokenizer/sentence splitter 
but I assume the output would be something like DataSet[Seq[String]] in case of 
raw text or DataSet[TupleX[A]] in case of bag of words or images, which then is 
transformed to a Sparse/DenseVector by the feature hasher. 

I was wondering whether there is any convention or standard you want to follow?

> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature 
> values. The hash of the feature value is used to calculate its index for a 
> vector entry. In order to mitigate possible collisions, a second hashing 
> function is used to calculate the sign for the update value which is added to 
> the vector entry. This way, it is likely that collision will simply cancel 
> out.
> A feature hasher would also be helpful for NLP problems where it could be 
> used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] 
> [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to