spark git commit: [SPARK-23048][ML] Add OneHotEncoderEstimator document and examples

mlnick Fri, 19 Jan 2018 02:49:46 -0800

Repository: spark
Updated Branches:
  refs/heads/master 60203fca6 -> b74366481



[SPARK-23048][ML] Add OneHotEncoderEstimator document and examples

## What changes were proposed in this pull request?

We have `OneHotEncoderEstimator` now and `OneHotEncoder` will be deprecated 
since 2.3.0. We should add `OneHotEncoderEstimator` into mllib document.

We also need to provide corresponding examples for `OneHotEncoderEstimator` 
which are used in the document too.

## How was this patch tested?

Existing tests.

Author: Liang-Chi Hsieh <vii...@gmail.com>

Closes #20257 from viirya/SPARK-23048.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b7436648
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b7436648
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b7436648

Branch: refs/heads/master
Commit: b74366481cc87490adf4e69d26389ec737548c15
Parents: 60203fc
Author: Liang-Chi Hsieh <vii...@gmail.com>
Authored: Fri Jan 19 12:48:42 2018 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Fri Jan 19 12:48:42 2018 +0200

----------------------------------------------------------------------
 docs/ml-features.md                             | 28 ++++---
 .../ml/JavaOneHotEncoderEstimatorExample.java   | 74 ++++++++++++++++++
 .../examples/ml/JavaOneHotEncoderExample.java   | 79 --------------------
 .../ml/onehot_encoder_estimator_example.py      | 49 ++++++++++++
 .../main/python/ml/onehot_encoder_example.py    | 50 -------------
 .../ml/OneHotEncoderEstimatorExample.scala      | 56 ++++++++++++++
 .../examples/ml/OneHotEncoderExample.scala      | 60 ---------------
 7 files changed, 197 insertions(+), 199 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 10183c3..466a8fb 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -775,35 +775,43 @@ for more details on the API.
 </div>
 </div>
 
-## OneHotEncoder
+## OneHotEncoder (Deprecated since 2.3.0)
 
-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of 
label indices to a column of binary vectors, with at most a single one-value. 
This encoding allows algorithms which expect continuous features, such as 
Logistic Regression, to use categorical features.
+Because this existing `OneHotEncoder` is a stateless transformer, it is not 
usable on new data where the number of categories may differ from the training 
data. In order to fix this, a new `OneHotEncoderEstimator` was created that 
produces an `OneHotEncoderModel` when fitting. For more detail, please see 
[SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030).
+
+`OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. 
Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) 
instead.
+
+## OneHotEncoderEstimator
+
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical 
feature, represented as a label index, to a binary vector with at most a single 
one-value indicating the presence of a specific feature value from among the 
set of all feature values. This encoding allows algorithms which expect 
continuous features, such as Logistic Regression, to use categorical features. 
For string type input data, it is common to encode categorical features using 
[StringIndexer](ml-features.html#stringindexer) first.
+
+`OneHotEncoderEstimator` can transform multiple columns, returning an 
one-hot-encoded output vector column for each input column. It is common to 
merge these vectors into a single feature vector using 
[VectorAssembler](ml-features.html#vectorassembler).
+
+`OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how 
to handle invalid input during transforming data. Available options include 
'keep' (any invalid inputs are assigned to an extra categorical index) and 
'error' (throw an error).
 
 **Examples**
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-Refer to the [OneHotEncoder Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder)
-for more details on the API.
+Refer to the [OneHotEncoderEstimator Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoderEstimator) 
for more details on the API.
 
-{% include_example 
scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
+{% include_example 
scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
 
-Refer to the [OneHotEncoder Java 
docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html)
+Refer to the [OneHotEncoderEstimator Java 
docs](api/java/org/apache/spark/ml/feature/OneHotEncoderEstimator.html)
 for more details on the API.
 
-{% include_example 
java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java %}
+{% include_example 
java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
 
-Refer to the [OneHotEncoder Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
-for more details on the API.
+Refer to the [OneHotEncoderEstimator Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator) for 
more details on the API.
 
-{% include_example python/ml/onehot_encoder_example.py %}
+{% include_example python/ml/onehot_encoder_estimator_example.py %}
 </div>
 </div>
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java
----------------------------------------------------------------------
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java
new file mode 100644
index 0000000..6f93cff
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.OneHotEncoderEstimator;
+import org.apache.spark.ml.feature.OneHotEncoderModel;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaOneHotEncoderEstimatorExample {
+  public static void main(String[] args) {
+    SparkSession spark = SparkSession
+      .builder()
+      .appName("JavaOneHotEncoderEstimatorExample")
+      .getOrCreate();
+
+    // Note: categorical features are usually first encoded with StringIndexer
+    // $example on$
+    List<Row> data = Arrays.asList(
+      RowFactory.create(0.0, 1.0),
+      RowFactory.create(1.0, 0.0),
+      RowFactory.create(2.0, 1.0),
+      RowFactory.create(0.0, 2.0),
+      RowFactory.create(0.0, 1.0),
+      RowFactory.create(2.0, 0.0)
+    );
+
+    StructType schema = new StructType(new StructField[]{
+      new StructField("categoryIndex1", DataTypes.DoubleType, false, 
Metadata.empty()),
+      new StructField("categoryIndex2", DataTypes.DoubleType, false, 
Metadata.empty())
+    });
+
+    Dataset<Row> df = spark.createDataFrame(data, schema);
+
+    OneHotEncoderEstimator encoder = new OneHotEncoderEstimator()
+      .setInputCols(new String[] {"categoryIndex1", "categoryIndex2"})
+      .setOutputCols(new String[] {"categoryVec1", "categoryVec2"});
+
+    OneHotEncoderModel model = encoder.fit(df);
+    Dataset<Row> encoded = model.transform(df);
+    encoded.show();
+    // $example off$
+
+    spark.stop();
+  }
+}
+

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java
----------------------------------------------------------------------
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java
deleted file mode 100644
index 99af376..0000000
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java
+++ /dev/null
@@ -1,79 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *    http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml;
-
-import org.apache.spark.sql.SparkSession;
-
-// $example on$
-import java.util.Arrays;
-import java.util.List;
-
-import org.apache.spark.ml.feature.OneHotEncoder;
-import org.apache.spark.ml.feature.StringIndexer;
-import org.apache.spark.ml.feature.StringIndexerModel;
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.RowFactory;
-import org.apache.spark.sql.types.DataTypes;
-import org.apache.spark.sql.types.Metadata;
-import org.apache.spark.sql.types.StructField;
-import org.apache.spark.sql.types.StructType;
-// $example off$
-
-public class JavaOneHotEncoderExample {
-  public static void main(String[] args) {
-    SparkSession spark = SparkSession
-      .builder()
-      .appName("JavaOneHotEncoderExample")
-      .getOrCreate();
-
-    // $example on$
-    List<Row> data = Arrays.asList(
-      RowFactory.create(0, "a"),
-      RowFactory.create(1, "b"),
-      RowFactory.create(2, "c"),
-      RowFactory.create(3, "a"),
-      RowFactory.create(4, "a"),
-      RowFactory.create(5, "c")
-    );
-
-    StructType schema = new StructType(new StructField[]{
-      new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
-      new StructField("category", DataTypes.StringType, false, 
Metadata.empty())
-    });
-
-    Dataset<Row> df = spark.createDataFrame(data, schema);
-
-    StringIndexerModel indexer = new StringIndexer()
-      .setInputCol("category")
-      .setOutputCol("categoryIndex")
-      .fit(df);
-    Dataset<Row> indexed = indexer.transform(df);
-
-    OneHotEncoder encoder = new OneHotEncoder()
-      .setInputCol("categoryIndex")
-      .setOutputCol("categoryVec");
-
-    Dataset<Row> encoded = encoder.transform(indexed);
-    encoded.show();
-    // $example off$
-
-    spark.stop();
-  }
-}
-

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/python/ml/onehot_encoder_estimator_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/ml/onehot_encoder_estimator_example.py 
b/examples/src/main/python/ml/onehot_encoder_estimator_example.py
new file mode 100644
index 0000000..2723e68
--- /dev/null
+++ b/examples/src/main/python/ml/onehot_encoder_estimator_example.py
@@ -0,0 +1,49 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import OneHotEncoderEstimator
+# $example off$
+from pyspark.sql import SparkSession
+
+if __name__ == "__main__":
+    spark = SparkSession\
+        .builder\
+        .appName("OneHotEncoderEstimatorExample")\
+        .getOrCreate()
+
+    # Note: categorical features are usually first encoded with StringIndexer
+    # $example on$
+    df = spark.createDataFrame([
+        (0.0, 1.0),
+        (1.0, 0.0),
+        (2.0, 1.0),
+        (0.0, 2.0),
+        (0.0, 1.0),
+        (2.0, 0.0)
+    ], ["categoryIndex1", "categoryIndex2"])
+
+    encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", 
"categoryIndex2"],
+                                     outputCols=["categoryVec1", 
"categoryVec2"])
+    model = encoder.fit(df)
+    encoded = model.transform(df)
+    encoded.show()
+    # $example off$
+
+    spark.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/python/ml/onehot_encoder_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/ml/onehot_encoder_example.py 
b/examples/src/main/python/ml/onehot_encoder_example.py
deleted file mode 100644
index e1996c7..0000000
--- a/examples/src/main/python/ml/onehot_encoder_example.py
+++ /dev/null
@@ -1,50 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#    http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-from __future__ import print_function
-
-# $example on$
-from pyspark.ml.feature import OneHotEncoder, StringIndexer
-# $example off$
-from pyspark.sql import SparkSession
-
-if __name__ == "__main__":
-    spark = SparkSession\
-        .builder\
-        .appName("OneHotEncoderExample")\
-        .getOrCreate()
-
-    # $example on$
-    df = spark.createDataFrame([
-        (0, "a"),
-        (1, "b"),
-        (2, "c"),
-        (3, "a"),
-        (4, "a"),
-        (5, "c")
-    ], ["id", "category"])
-
-    stringIndexer = StringIndexer(inputCol="category", 
outputCol="categoryIndex")
-    model = stringIndexer.fit(df)
-    indexed = model.transform(df)
-
-    encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
-    encoded = encoder.transform(indexed)
-    encoded.show()
-    # $example off$
-
-    spark.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala
----------------------------------------------------------------------
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala
new file mode 100644
index 0000000..45d8168
--- /dev/null
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.OneHotEncoderEstimator
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object OneHotEncoderEstimatorExample {
+  def main(args: Array[String]): Unit = {
+    val spark = SparkSession
+      .builder
+      .appName("OneHotEncoderEstimatorExample")
+      .getOrCreate()
+
+    // Note: categorical features are usually first encoded with StringIndexer
+    // $example on$
+    val df = spark.createDataFrame(Seq(
+      (0.0, 1.0),
+      (1.0, 0.0),
+      (2.0, 1.0),
+      (0.0, 2.0),
+      (0.0, 1.0),
+      (2.0, 0.0)
+    )).toDF("categoryIndex1", "categoryIndex2")
+
+    val encoder = new OneHotEncoderEstimator()
+      .setInputCols(Array("categoryIndex1", "categoryIndex2"))
+      .setOutputCols(Array("categoryVec1", "categoryVec2"))
+    val model = encoder.fit(df)
+
+    val encoded = model.transform(df)
+    encoded.show()
+    // $example off$
+
+    spark.stop()
+  }
+}
+// scalastyle:on println

http://git-wip-us.apache.org/repos/asf/spark/blob/b7436648/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala
----------------------------------------------------------------------
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala
deleted file mode 100644
index 274cc12..0000000
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala
+++ /dev/null
@@ -1,60 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *    http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-// scalastyle:off println
-package org.apache.spark.examples.ml
-
-// $example on$
-import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
-// $example off$
-import org.apache.spark.sql.SparkSession
-
-object OneHotEncoderExample {
-  def main(args: Array[String]): Unit = {
-    val spark = SparkSession
-      .builder
-      .appName("OneHotEncoderExample")
-      .getOrCreate()
-
-    // $example on$
-    val df = spark.createDataFrame(Seq(
-      (0, "a"),
-      (1, "b"),
-      (2, "c"),
-      (3, "a"),
-      (4, "a"),
-      (5, "c")
-    )).toDF("id", "category")
-
-    val indexer = new StringIndexer()
-      .setInputCol("category")
-      .setOutputCol("categoryIndex")
-      .fit(df)
-    val indexed = indexer.transform(df)
-
-    val encoder = new OneHotEncoder()
-      .setInputCol("categoryIndex")
-      .setOutputCol("categoryVec")
-
-    val encoded = encoder.transform(indexed)
-    encoded.show()
-    // $example off$
-
-    spark.stop()
-  }
-}
-// scalastyle:on println


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23048][ML] Add OneHotEncoderEstimator document and examples

Reply via email to