Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/19024#discussion_r134741953 --- Diff: docs/ml-features.md --- @@ -211,6 +211,65 @@ for more details on the API. </div> </div> +## FeatureHasher + +Feature hashing projects a set of categorical or numerical features into a feature vector of +specified dimension (typically substantially smaller than that of the original feature +space). This is done using the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing) +to map features to indices in the feature vector. + +The `FeatureHasher` transformer operates on multiple columns. Each column may contain either +numeric or categorical features. Behavior and handling of column data types is as follows: + +- Numeric columns: For numeric features, the hash value of the column name is used to map the +feature value to its index in the feature vector. Numeric features are never treated as +categorical, even when they are integers. You must explicitly convert numeric columns containing +categorical features to strings first. +- String columns: For categorical features, the hash value of the string "column_name=value" +is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features +are "one-hot" encoded (similarly to using `OneHotEncoder` with `dropLast=false`). +- Boolean columns: Boolean values are treated in the same way as string columns. That is, +boolean features are represented as "column_name=true" or "column_name=false", with an indicator +value of `1.0`. + +Null (missing) values are ignored (implicitly zero in the resulting feature vector). + +Since a simple modulo is used to transform the hash function to a vector index, --- End diff -- We should probably say something to the effect that the hashing mechanism is the same as used for `HashingTF`
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org