[ 
https://issues.apache.org/jira/browse/FLINK-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536825#comment-14536825
 ] 

ASF GitHub Bot commented on FLINK-1735:
---------------------------------------

GitHub user ChristophAl opened a pull request:

    https://github.com/apache/flink/pull/665

    [FLINK-1735] Feature Hasher

    The prototype of the feature hasher.
    
    - The implementation is based on the scikit-learn feature hasher
    - Test vectors have been generated by scikit-learn as well
    - Currently the implementation only works on Seq[String]

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChristophAl/flink FLINK-1735_FeatureHasher

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/665.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #665
    
----
commit e5ad7e842f443dd4b15fe21f3d1d89c238c882d1
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-06T22:10:24Z

    Initial commit Issue #1735

commit 1e9312fdc46b741faea6bdfb26fc4ce359cd1cfa
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-08T13:54:53Z

    Added basic testcase for FeatureHasher

commit a0c6ee6251edc4d0e556ba98886a783a072bd27b
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-08T13:58:59Z

    FeatureHasher prototype
    
    - Added a prototype of Feature Hasher, currently accepts Seq[String] only

commit c55eb11fa21943dd8451256755bc707a59c3f5d3
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-08T14:09:48Z

    Corrected typos

commit 7002ab9e18a6cca5b55d700967accb375538faad
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-09T14:25:42Z

    Moved Featurehasher to feature.extraction package

commit 15b868f08806b375fff564f851f668122d363457
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-09T14:31:19Z

    Readded FeatureHasher.scala

commit 38e0650ebdec305c4a51e788699da0809a3b1973
Author: Christoph Alt <christoph....@posteo.de>
Date:   2015-05-09T18:36:00Z

    Reformated test vectors

----


> Add FeatureHasher to machine learning library
> ---------------------------------------------
>
>                 Key: FLINK-1735
>                 URL: https://issues.apache.org/jira/browse/FLINK-1735
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Till Rohrmann
>            Assignee: Felix Neutatz
>              Labels: ML
>
> Using the hashing trick [1,2] is a common way to vectorize arbitrary feature 
> values. The hash of the feature value is used to calculate its index for a 
> vector entry. In order to mitigate possible collisions, a second hashing 
> function is used to calculate the sign for the update value which is added to 
> the vector entry. This way, it is likely that collision will simply cancel 
> out.
> A feature hasher would also be helpful for NLP problems where it could be 
> used to vectorize bag of words or ngrams feature vectors.
> Resources:
> [1] [https://en.wikipedia.org/wiki/Feature_hashing]
> [2] 
> [http://scikit-learn.org/stable/modules/feature_extraction.html#feature-extraction]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to