[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-136155023 Thanks for helping review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135798504 LGTM except some minor issues with Java imports. I will fix those in a separate PR. Merged into master and branch-1.5. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/8487 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135681877 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135681880 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135681561 [Test build #41740 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/console) for PR 8487 at commit [`007c369`](https://github.com/apache/spark/commit/007c3691b9bc2a3f1c2f5007a1b6f4e73c5c4b06). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135671505 [Test build #41740 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41740/consoleFull) for PR 8487 at commit [`007c369`](https://github.com/apache/spark/commit/007c3691b9bc2a3f1c2f5007a1b6f4e73c5c4b06). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135669129 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135668992 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115481 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. --- End diff -- minor, break lines at 100 chars --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115523 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), + RowFactory.create(Arrays.asList("a b b c a".split(" "))) +)); +StructType schema = new StructType(new StructField[]{ + new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) +}); +DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema); + +// define CountVectorizerModel with a-priori vocabulary +CountVectorizerModel cv = new CountVectorizerModel(new String[]{"a", "b", "c"}) --- End diff -- Ditto. Show `CountVectorizer` first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115498 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() --- End diff -- `.collect()` -> `.show()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115518 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), + RowFactory.create(Arrays.asList("a b b c a".split(" "))) +)); +StructType schema = new StructType(new StructField[]{ --- End diff -- * space before and after `[]` * `StructType` and `StructField` missing imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115508 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( --- End diff -- `Arrays` missing import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115521 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), + RowFactory.create(Arrays.asList("a b b c a".split(" "))) +)); +StructType schema = new StructType(new StructField[]{ + new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) +}); +DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema); --- End diff -- `documentDF` -> `df` to be consistent with Scala code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115529 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), + RowFactory.create(Arrays.asList("a b b c a".split(" "))) +)); +StructType schema = new StructType(new StructField[]{ + new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) +}); +DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema); + +// define CountVectorizerModel with a-priori vocabulary +CountVectorizerModel cv = new CountVectorizerModel(new String[]{"a", "b", "c"}) + .setInputCol("text") + .setOutputCol("feature"); + +// alternatively, fit a CountVectorizerModel from the corpus +CountVectorizerModel cv2 = new CountVectorizer() + .setInputCol("text") + .setOutputCol("feature") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(documentDF); + +DataFrame result = cv.transform(documentDF); --- End diff -- use `cv.transform(documentDF).show()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115483 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + --- End diff -- It might be useful to show the table before and after, as in the user guide of `StringIndexer`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115492 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() --- End diff -- `cv2` -> `cvm` or `cvModel`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115487 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) --- End diff -- Ditto. Show `CountVectorizer` first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115476 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. --- End diff -- Shall we start with `CountVectorizer` but not `CountVectorizerModel`? I guess most users would use `CountVectorizer` to build the vocabulary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115514 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + + + +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} + + + +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), --- End diff -- * `Arrays.asList("a", "b", "c")` (simple logic for example code) * `RowFactory` missing import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135473727 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135473725 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135473553 [Test build #41692 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/console) for PR 8487 at commit [`4e37227`](https://github.com/apache/spark/commit/4e372279a6e8f5646e72e23b6d9e89c786196b5c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135469480 [Test build #41692 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41692/consoleFull) for PR 8487 at commit [`4e37227`](https://github.com/apache/spark/commit/4e372279a6e8f5646e72e23b6d9e89c786196b5c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
GitHub user hhbyyh opened a pull request: https://github.com/apache/spark/pull/8487 [SPARK-9890] [Doc] [ML] User guide for CountVectorizer jira: https://issues.apache.org/jira/browse/SPARK-9890 document with Scala and java examples You can merge this pull request into a Git repository by running: $ git pull https://github.com/hhbyyh/spark cvDoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8487.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8487 commit 80c550873e44c2d5ecf3b7d1bd7332367912c1a0 Author: Yuhao Yang Date: 2015-08-27T13:47:08Z draft for scala commit 4e372279a6e8f5646e72e23b6d9e89c786196b5c Author: Yuhao Yang Date: 2015-08-27T15:17:25Z add java example --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135467409 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8487#issuecomment-135467461 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org