[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

mengxr Thu, 27 Aug 2015 09:27:06 -0700

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8487#discussion_r38115514
  
    --- Diff: docs/ml-features.md ---
    @@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
     </div>
     </div>
     
    +## CountVectorizer
    +
    +As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
    +It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
    +
    +When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
    +It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
    +An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
    +
    +<div class="codetabs">
    +<div data-lang="scala" markdown="1">
    +More details can be found in the API docs for
    
+[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
 and
    
+[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
    +{% highlight scala %}
    +import org.apache.spark.ml.feature.CountVectorizer
    +import org.apache.spark.mllib.util.CountVectorizerModel
    +
    +val df = sqlContext.createDataFrame(Seq(
    +  (0, Array("a", "b", "c")),
    +  (1, Array("a", "b", "b", "c", "a"))
    +)).toDF("id", "words")
    +
    +// define CountVectorizerModel with a-priori vocabulary
    +val cv = new CountVectorizerModel(Array("a", "b", "c"))
    +  .setInputCol("words")
    +  .setOutputCol("features")
    +
    +// alternatively, fit a CountVectorizerModel from the corpus
    +val cv2: CountVectorizerModel = new CountVectorizer()
    +  .setInputCol("words")
    +  .setOutputCol("features")
    +  .setVocabSize(3)
    +  .setMinDF(2) // a term must appear in more than 2 documents to be 
included in the vocabulary
    +  .fit(df)
    +
    +cv.transform(df).select("features").collect()
    +{% endhighlight %}
    +</div>
    +
    +<div data-lang="java" markdown="1">
    +More details can be found in the API docs for
    
+[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) 
and
    
+[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
    +{% highlight java %}
    +import org.apache.spark.api.java.JavaRDD;
    +import org.apache.spark.ml.feature.CountVectorizer;
    +import org.apache.spark.ml.feature.CountVectorizerModel;
    +import org.apache.spark.sql.DataFrame;
    +
    +// Input data: Each row is a bag of words from a sentence or document.
    +JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
    +  RowFactory.create(Arrays.asList("a b c".split(" "))),
    --- End diff --
    
    * `Arrays.asList("a", "b", "c")` (simple logic for example code)
    * `RowFactory` missing import



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9890] [Doc] [ML] User guide for CountVe...

Reply via email to