[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-30 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-136155023
  
Thanks for helping review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135798504
  
LGTM except some minor issues with Java imports. I will fix those in a 
separate PR. Merged into master and branch-1.5. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8487


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135681877
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135681880
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135681561
  
  [Test build #41740 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/console)
 for   PR 8487 at commit 
[`007c369`](https://github.com/apache/spark/commit/007c3691b9bc2a3f1c2f5007a1b6f4e73c5c4b06).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135671505
  
  [Test build #41740 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/consoleFull)
 for   PR 8487 at commit 
[`007c369`](https://github.com/apache/spark/commit/007c3691b9bc2a3f1c2f5007a1b6f4e73c5c4b06).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135669129
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135668992
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115481
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
--- End diff --

minor, break lines at 100 chars


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115523
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Arrays.asList("a b c".split(" "))),
+  RowFactory.create(Arrays.asList("a b b c a".split(" ")))
+));
+StructType schema = new StructType(new StructField[]{
+  new StructField("text", new ArrayType(DataTypes.StringType, true), 
false, Metadata.empty())
+});
+DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
+
+// define CountVectorizerModel with a-priori vocabulary
+CountVectorizerModel cv = new CountVectorizerModel(new String[]{"a", "b", 
"c"})
--- End diff --

Ditto. Show `CountVectorizer` first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115498
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
--- End diff --

`.collect()` -> `.show()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115518
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Arrays.asList("a b c".split(" "))),
+  RowFactory.create(Arrays.asList("a b b c a".split(" ")))
+));
+StructType schema = new StructType(new StructField[]{
--- End diff --

* space before and after `[]`
* `StructType` and `StructField` missing imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115508
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
--- End diff --

`Arrays` missing import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115521
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Arrays.asList("a b c".split(" "))),
+  RowFactory.create(Arrays.asList("a b b c a".split(" ")))
+));
+StructType schema = new StructType(new StructField[]{
+  new StructField("text", new ArrayType(DataTypes.StringType, true), 
false, Metadata.empty())
+});
+DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
--- End diff --

`documentDF` -> `df` to be consistent with Scala code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115529
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Arrays.asList("a b c".split(" "))),
+  RowFactory.create(Arrays.asList("a b b c a".split(" ")))
+));
+StructType schema = new StructType(new StructField[]{
+  new StructField("text", new ArrayType(DataTypes.StringType, true), 
false, Metadata.empty())
+});
+DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
+
+// define CountVectorizerModel with a-priori vocabulary
+CountVectorizerModel cv = new CountVectorizerModel(new String[]{"a", "b", 
"c"})
+  .setInputCol("text")
+  .setOutputCol("feature");
+
+// alternatively, fit a CountVectorizerModel from the corpus
+CountVectorizerModel cv2 = new CountVectorizer()
+  .setInputCol("text")
+  .setOutputCol("feature")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(documentDF);
+
+DataFrame result = cv.transform(documentDF);
--- End diff --

use `cv.transform(documentDF).show()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115483
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
--- End diff --

It might be useful to show the table before and after, as in the user guide 
of `StringIndexer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115492
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
--- End diff --

`cv2` -> `cvm` or `cvModel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115487
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
--- End diff --

Ditto. Show `CountVectorizer` first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115476
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
--- End diff --

Shall we start with `CountVectorizer` but not `CountVectorizerModel`? I 
guess most users would use `CountVectorizer` to build the vocabulary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8487#discussion_r38115514
  
--- Diff: docs/ml-features.md ---
@@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
 
 
 
+## CountVectorizer
+
+As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
+It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
+
+When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
+It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
+An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and

+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+{% highlight scala %}
+import org.apache.spark.ml.feature.CountVectorizer
+import org.apache.spark.mllib.util.CountVectorizerModel
+
+val df = sqlContext.createDataFrame(Seq(
+  (0, Array("a", "b", "c")),
+  (1, Array("a", "b", "b", "c", "a"))
+)).toDF("id", "words")
+
+// define CountVectorizerModel with a-priori vocabulary
+val cv = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features")
+
+// alternatively, fit a CountVectorizerModel from the corpus
+val cv2: CountVectorizerModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setVocabSize(3)
+  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
+  .fit(df)
+
+cv.transform(df).select("features").collect()
+{% endhighlight %}
+
+
+
+More details can be found in the API docs for

+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and

+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.CountVectorizer;
+import org.apache.spark.ml.feature.CountVectorizerModel;
+import org.apache.spark.sql.DataFrame;
+
+// Input data: Each row is a bag of words from a sentence or document.
+JavaRDD jrdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Arrays.asList("a b c".split(" "))),
--- End diff --

* `Arrays.asList("a", "b", "c")` (simple logic for example code)
* `RowFactory` missing import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135473727
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135473725
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135473553
  
  [Test build #41692 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/console)
 for   PR 8487 at commit 
[`4e37227`](https://github.com/apache/spark/commit/4e372279a6e8f5646e72e23b6d9e89c786196b5c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135469480
  
  [Test build #41692 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/consoleFull)
 for   PR 8487 at commit 
[`4e37227`](https://github.com/apache/spark/commit/4e372279a6e8f5646e72e23b6d9e89c786196b5c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread hhbyyh
GitHub user hhbyyh opened a pull request:

https://github.com/apache/spark/pull/8487

[SPARK-9890] [Doc] [ML] User guide for CountVectorizer

jira: https://issues.apache.org/jira/browse/SPARK-9890

document with Scala and java examples

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hhbyyh/spark cvDoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8487.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8487


commit 80c550873e44c2d5ecf3b7d1bd7332367912c1a0
Author: Yuhao Yang 
Date:   2015-08-27T13:47:08Z

draft for scala

commit 4e372279a6e8f5646e72e23b6d9e89c786196b5c
Author: Yuhao Yang 
Date:   2015-08-27T15:17:25Z

add java example




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135467409
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

2015-08-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8487#issuecomment-135467461
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org