spark git commit: [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher

mlnick Wed, 30 Aug 2017 07:01:26 -0700

Repository: spark
Updated Branches:
  refs/heads/master b30a11a6a -> 4133c1b0a



[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher

## What changes were proposed in this pull request?

This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.

## How was this patch tested?

Manually ran examples and verified that output is consistent for different APIs

Author: Bryan Cutler <cutl...@gmail.com>

Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4133c1b0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4133c1b0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4133c1b0

Branch: refs/heads/master
Commit: 4133c1b0abb22f728fbff287f4f77a06ab88bbe8
Parents: b30a11a
Author: Bryan Cutler <cutl...@gmail.com>
Authored: Wed Aug 30 16:00:29 2017 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Wed Aug 30 16:00:29 2017 +0200

----------------------------------------------------------------------
 docs/ml-features.md                             | 91 +++++++++++++++++++-
 .../examples/ml/JavaFeatureHasherExample.java   | 69 +++++++++++++++
 .../main/python/ml/feature_hasher_example.py    | 46 ++++++++++
 .../examples/ml/FeatureHasherExample.scala      | 50 +++++++++++
 .../apache/spark/ml/feature/FeatureHasher.scala |  7 +-
 5 files changed, 256 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/4133c1b0/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index e19fba2..86a0e09 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -53,9 +53,9 @@ are calculated based on the mapped indices. This approach 
avoids the need to com
 term-to-index map, which can be expensive for a large corpus, but it suffers 
from potential hash 
 collisions, where different raw features may become the same term after 
hashing. To reduce the 
 chance of collision, we can increase the target feature dimension, i.e. the 
number of buckets 
-of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
-it is advisable to use a power of two as the feature dimension, otherwise the 
features will 
-not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`.
+of the hash table. Since a simple modulo on the hashed value is used to 
determine the vector index,
+it is advisable to use a power of two as the feature dimension, otherwise the 
features will not
+be mapped evenly to the vector indices. The default feature dimension is 
`$2^{18} = 262,144$`.
 An optional binary toggle parameter controls term frequency counts. When set 
to true all nonzero
 frequency counts are set to 1. This is especially useful for discrete 
probabilistic models that
 model binary, rather than integer, counts.
@@ -65,7 +65,7 @@ model binary, rather than integer, counts.
 
 **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an 
`IDFModel`.  The 
 `IDFModel` takes feature vectors (generally created from `HashingTF` or 
`CountVectorizer`) and 
-scales each column. Intuitively, it down-weights columns which appear 
frequently in a corpus.
+scales each feature. Intuitively, it down-weights features which appear 
frequently in a corpus.
 
 **Note:** `spark.ml` doesn't provide tools for text segmentation.
 We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and 
@@ -211,6 +211,89 @@ for more details on the API.
 </div>
 </div>
 
+## FeatureHasher
+
+Feature hashing projects a set of categorical or numerical features into a 
feature vector of
+specified dimension (typically substantially smaller than that of the original 
feature
+space). This is done using the [hashing 
trick](https://en.wikipedia.org/wiki/Feature_hashing)
+to map features to indices in the feature vector.
+
+The `FeatureHasher` transformer operates on multiple columns. Each column may 
contain either
+numeric or categorical features. Behavior and handling of column data types is 
as follows:
+
+- Numeric columns: For numeric features, the hash value of the column name is 
used to map the
+feature value to its index in the feature vector. Numeric features are never 
treated as
+categorical, even when they are integers. You must explicitly convert numeric 
columns containing
+categorical features to strings first.
+- String columns: For categorical features, the hash value of the string 
"column_name=value"
+is used to map to the vector index, with an indicator value of `1.0`. Thus, 
categorical features
+are "one-hot" encoded (similarly to using 
[OneHotEncoder](ml-features.html#onehotencoder) with
+`dropLast=false`).
+- Boolean columns: Boolean values are treated in the same way as string 
columns. That is,
+boolean features are represented as "column_name=true" or "column_name=false", 
with an indicator
+value of `1.0`.
+
+Null (missing) values are ignored (implicitly zero in the resulting feature 
vector).
+
+The hash function used here is also the [MurmurHash 
3](https://en.wikipedia.org/wiki/MurmurHash)
+used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo on the 
hashed value is used to
+determine the vector index, it is advisable to use a power of two as the 
numFeatures parameter;
+otherwise the features will not be mapped evenly to the vector indices.
+
+**Examples**
+
+Assume that we have a DataFrame with 4 input columns `real`, `bool`, 
`stringNum`, and `string`.
+These different data types as input will illustrate the behavior of the 
transform to produce a
+column of feature vectors.
+
+~~~~
+real| bool|stringNum|string
+----|-----|---------|------
+ 2.2| true|        1|   foo
+ 3.3|false|        2|   bar
+ 4.4|false|        3|   baz
+ 5.5|false|        4|   foo
+~~~~
+
+Then the output of `FeatureHasher.transform` on this DataFrame is:
+
+~~~~
+real|bool |stringNum|string|features
+----|-----|---------|------|-------------------------------------------------------
+2.2 |true |1        |foo   |(262144,[51871, 
63643,174475,253195],[1.0,1.0,2.2,1.0])
+3.3 |false|2        |bar   |(262144,[6031,  
80619,140467,174475],[1.0,1.0,1.0,3.3])
+4.4 |false|3        |baz   
|(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
+5.5 |false|4        |foo   
|(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])
+~~~~
+
+The resulting feature vectors could then be passed to a learning algorithm.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [FeatureHasher Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [FeatureHasher Java 
docs](api/java/org/apache/spark/ml/feature/FeatureHasher.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [FeatureHasher Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.FeatureHasher)
+for more details on the API.
+
+{% include_example python/ml/feature_hasher_example.py %}
+</div>
+</div>
+
 # Feature Transformers
 
 ## Tokenizer

http://git-wip-us.apache.org/repos/asf/spark/blob/4133c1b0/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java
----------------------------------------------------------------------
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java
new file mode 100644
index 0000000..9730d42
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.feature.FeatureHasher;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+// $example off$
+
+public class JavaFeatureHasherExample {
+  public static void main(String[] args) {
+    SparkSession spark = SparkSession
+      .builder()
+      .appName("JavaFeatureHasherExample")
+      .getOrCreate();
+
+    // $example on$
+    List<Row> data = Arrays.asList(
+      RowFactory.create(2.2, true, "1", "foo"),
+      RowFactory.create(3.3, false, "2", "bar"),
+      RowFactory.create(4.4, false, "3", "baz"),
+      RowFactory.create(5.5, false, "4", "foo")
+    );
+    StructType schema = new StructType(new StructField[]{
+      new StructField("real", DataTypes.DoubleType, false, Metadata.empty()),
+      new StructField("bool", DataTypes.BooleanType, false, Metadata.empty()),
+      new StructField("stringNum", DataTypes.StringType, false, 
Metadata.empty()),
+      new StructField("string", DataTypes.StringType, false, Metadata.empty())
+    });
+    Dataset<Row> dataset = spark.createDataFrame(data, schema);
+
+    FeatureHasher hasher = new FeatureHasher()
+      .setInputCols(new String[]{"real", "bool", "stringNum", "string"})
+      .setOutputCol("features");
+
+    Dataset<Row> featurized = hasher.transform(dataset);
+
+    featurized.show(false);
+    // $example off$
+
+    spark.stop();
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/4133c1b0/examples/src/main/python/ml/feature_hasher_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/ml/feature_hasher_example.py 
b/examples/src/main/python/ml/feature_hasher_example.py
new file mode 100644
index 0000000..6cf9ecc
--- /dev/null
+++ b/examples/src/main/python/ml/feature_hasher_example.py
@@ -0,0 +1,46 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import print_function
+
+from pyspark.sql import SparkSession
+# $example on$
+from pyspark.ml.feature import FeatureHasher
+# $example off$
+
+if __name__ == "__main__":
+    spark = SparkSession\
+        .builder\
+        .appName("FeatureHasherExample")\
+        .getOrCreate()
+
+    # $example on$
+    dataset = spark.createDataFrame([
+        (2.2, True, "1", "foo"),
+        (3.3, False, "2", "bar"),
+        (4.4, False, "3", "baz"),
+        (5.5, False, "4", "foo")
+    ], ["real", "bool", "stringNum", "string"])
+
+    hasher = FeatureHasher(inputCols=["real", "bool", "stringNum", "string"],
+                           outputCol="features")
+
+    featurized = hasher.transform(dataset)
+    featurized.show(truncate=False)
+    # $example off$
+
+    spark.stop()

http://git-wip-us.apache.org/repos/asf/spark/blob/4133c1b0/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala
----------------------------------------------------------------------
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala
 
b/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala
new file mode 100644
index 0000000..1aed10b
--- /dev/null
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.FeatureHasher
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object FeatureHasherExample {
+  def main(args: Array[String]): Unit = {
+    val spark = SparkSession
+      .builder
+      .appName("FeatureHasherExample")
+      .getOrCreate()
+
+    // $example on$
+    val dataset = spark.createDataFrame(Seq(
+      (2.2, true, "1", "foo"),
+      (3.3, false, "2", "bar"),
+      (4.4, false, "3", "baz"),
+      (5.5, false, "4", "foo")
+    )).toDF("real", "bool", "stringNum", "string")
+
+    val hasher = new FeatureHasher()
+      .setInputCols("real", "bool", "stringNum", "string")
+      .setOutputCol("features")
+
+    val featurized = hasher.transform(dataset)
+    featurized.show(false)
+    // $example off$
+
+    spark.stop()
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/4133c1b0/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala
index 4b91fa9..4615dae 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala
@@ -53,9 +53,10 @@ import org.apache.spark.util.collection.OpenHashMap
  *
  * Null (missing) values are ignored (implicitly zero in the resulting feature 
vector).
  *
- * Since a simple modulo is used to transform the hash function to a vector 
index,
- * it is advisable to use a power of two as the numFeatures parameter;
- * otherwise the features will not be mapped evenly to the vector indices.
+ * The hash function used here is also the MurmurHash 3 used in [[HashingTF]]. 
Since a simple modulo
+ * on the hashed value is used to determine the vector index, it is advisable 
to use a power of two
+ * as the numFeatures parameter; otherwise the features will not be mapped 
evenly to the vector
+ * indices.
  *
  * {{{
  *   val df = Seq(


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher

Reply via email to