Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8487#discussion_r38115514 --- Diff: docs/ml-features.md --- @@ -211,6 +211,87 @@ for feature in result.select("result").take(3): </div> </div> +## CountVectorizer + +As a transformer, `CountVectorizerModel` converts a collection of text documents to vectors of token counts. +It takes parameter `vocabulary: Array[String]` and produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA. + +When an a-priori dictionary is not available, `CountVectorizer` can be used as an Estimator to extract the vocabulary and generates a `CountVectorizerModel`. +It will select the top `vocabSize` words ordered by term frequency across the corpus. +An optional parameter "minDF" also affect the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +More details can be found in the API docs for +[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and +[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). +{% highlight scala %} +import org.apache.spark.ml.feature.CountVectorizer +import org.apache.spark.mllib.util.CountVectorizerModel + +val df = sqlContext.createDataFrame(Seq( + (0, Array("a", "b", "c")), + (1, Array("a", "b", "b", "c", "a")) +)).toDF("id", "words") + +// define CountVectorizerModel with a-priori vocabulary +val cv = new CountVectorizerModel(Array("a", "b", "c")) + .setInputCol("words") + .setOutputCol("features") + +// alternatively, fit a CountVectorizerModel from the corpus +val cv2: CountVectorizerModel = new CountVectorizer() + .setInputCol("words") + .setOutputCol("features") + .setVocabSize(3) + .setMinDF(2) // a term must appear in more than 2 documents to be included in the vocabulary + .fit(df) + +cv.transform(df).select("features").collect() +{% endhighlight %} +</div> + +<div data-lang="java" markdown="1"> +More details can be found in the API docs for +[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and +[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). +{% highlight java %} +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.CountVectorizer; +import org.apache.spark.ml.feature.CountVectorizerModel; +import org.apache.spark.sql.DataFrame; + +// Input data: Each row is a bag of words from a sentence or document. +JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList( + RowFactory.create(Arrays.asList("a b c".split(" "))), --- End diff -- * `Arrays.asList("a", "b", "c")` (simple logic for example code) * `RowFactory` missing import
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org