GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/19527
[SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as Estimator ## What changes were proposed in this pull request? This patch adds a new class `OneHotEncoderEstimator` which extends `Estimator`. The `fit` method returns `OneHotEncoderModel`. Common methods between existing `OneHotEncoder` and new `OneHotEncoderEstimator`, such as transforming schema, are extracted and put into `OneHotEncoderCommon`. ### Multi-column support `OneHotEncoderEstimator` adds simpler multi-column support because it is new API and can be free from backward compatibility. ### handleInvalid Param support `OneHotEncoderEstimator` supports `handleInvalid` Param. It supports `error` and `skip`. Note that `skip` can't be used at the same time with `dropLast` as true. Because they will conflict in encoded vector. ## How was this patch tested? Added new test suite `OneHotEncoderEstimatorSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-13030 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19527 ---- commit 8fd4677fd0e729d99d8777010e78bb5cfea3cf86 Author: Liang-Chi Hsieh <vii...@gmail.com> Date: 2017-10-18T07:31:32Z Add OneHotEncoderEstimator and related tests. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org