GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/18513

    [SPARK-13969][ML] Add FeatureHasher transformer

    This PR adds a `FeatureHasher` transformer, modeled on 
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html)
 and [Vowpal 
wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
    
    The transformer operates on multiple input columns in one pass. Current 
behavior is:
    * for numerical columns, the values are assumed to be real values and the 
feature index is `hash(columnName)` while feature value is `feature_value`
    * for string columns, the values are assumed to be categorical and the 
feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
    * For hash collisions, feature values will be summed
    * `null` (missing) values are ignored
    
    The following dataframe illustrates the basic semantics:
    ```
    
+---+------+-----+---------+------+-----------------------------------------+
    |int|double|float|stringNum|string|features                                 
|
    
+---+------+-----+---------+------+-----------------------------------------+
    |3  |4.0   |5.0  |1        |foo   
|(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
    |6  |7.0   |8.0  |2        |bar   
|(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
    
+---+------+-----+---------+------+-----------------------------------------+
    ```
    
    ## How was this patch tested?
    
    New unit tests and manual experiments.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark FeatureHasher

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18513.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18513
    
----
commit 6ab19a963f35de29af0a6b7b1598d5add78f200a
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2016-08-23T10:29:06Z

    initial WIP

commit ebd2cbf3467f26121c602f7c77c2018253cbdf18
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-02-01T10:43:07Z

    Further work

commit ba255bfda792d58aaded892e49c6cf48f0391159
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T10:52:12Z

    Clean up

commit 0be1e6572110d7d550f69fd86d3dd4e96660fde6
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T10:52:37Z

    Add tests

commit 2f3ea21e2e1835d7218e8c7bd096cc0787ed595c
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-22T13:08:26Z

    Copy, save/load, clean up

commit 7d678fbf5f88d377b79153212a3e0a2596039b17
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-06-26T12:38:02Z

    Move numFeatures to HasNumFeatures shared trait

commit 60572776de80ebcf1782c3d7def749557c8bec61
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-07-03T07:18:25Z

    Update shared params from codegen run

commit 9edb3bda8cbc4e00f05b91718249edf2750fc028
Author: Nick Pentreath <ni...@za.ibm.com>
Date:   2017-07-03T09:32:32Z

    Update tests. Null values ignored in feature hashing.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to