Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8487#discussion_r38115483
  
    --- Diff: docs/ml-features.md ---
    @@ -211,6 +211,87 @@ for feature in result.select("result").take(3):
     </div>
     </div>
     
    +## CountVectorizer
    +
    +As a transformer, `CountVectorizerModel` converts a collection of text 
documents to vectors of token counts.
    +It takes parameter `vocabulary: Array[String]` and produces sparse 
representations for the documents over the vocabulary, which can then be passed 
to other algorithms like LDA.
    +
    +When an a-priori dictionary is not available, `CountVectorizer` can be 
used as an Estimator to extract the vocabulary and generates a 
`CountVectorizerModel`.
    +It will select the top `vocabSize` words ordered by term frequency across 
the corpus.
    +An optional parameter "minDF" also affect the fitting process by 
specifying the minimum number (or fraction if < 1.0) of documents a term must 
appear in to be included in the vocabulary.
    +
    --- End diff --
    
    It might be useful to show the table before and after, as in the user guide 
of `StringIndexer`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to