GitHub user MLnick opened a pull request: https://github.com/apache/spark/pull/18513
[SPARK-13969][ML] Add FeatureHasher transformer This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction). The transformer operates on multiple input columns in one pass. Current behavior is: * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value` * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0` * For hash collisions, feature values will be summed * `null` (missing) values are ignored The following dataframe illustrates the basic semantics: ``` +---+------+-----+---------+------+-----------------------------------------+ |int|double|float|stringNum|string|features | +---+------+-----+---------+------+-----------------------------------------+ |3 |4.0 |5.0 |1 |foo |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])| |6 |7.0 |8.0 |2 |bar |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])| +---+------+-----+---------+------+-----------------------------------------+ ``` ## How was this patch tested? New unit tests and manual experiments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MLnick/spark FeatureHasher Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18513.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18513 ---- commit 6ab19a963f35de29af0a6b7b1598d5add78f200a Author: Nick Pentreath <ni...@za.ibm.com> Date: 2016-08-23T10:29:06Z initial WIP commit ebd2cbf3467f26121c602f7c77c2018253cbdf18 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-02-01T10:43:07Z Further work commit ba255bfda792d58aaded892e49c6cf48f0391159 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-06-22T10:52:12Z Clean up commit 0be1e6572110d7d550f69fd86d3dd4e96660fde6 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-06-22T10:52:37Z Add tests commit 2f3ea21e2e1835d7218e8c7bd096cc0787ed595c Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-06-22T13:08:26Z Copy, save/load, clean up commit 7d678fbf5f88d377b79153212a3e0a2596039b17 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-06-26T12:38:02Z Move numFeatures to HasNumFeatures shared trait commit 60572776de80ebcf1782c3d7def749557c8bec61 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-07-03T07:18:25Z Update shared params from codegen run commit 9edb3bda8cbc4e00f05b91718249edf2750fc028 Author: Nick Pentreath <ni...@za.ibm.com> Date: 2017-07-03T09:32:32Z Update tests. Null values ignored in feature hashing. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org