[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

BryanCutler Wed, 14 Mar 2018 15:09:38 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20777#discussion_r174625203
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends 
Params with HasInputCol wit
       def getMinDF: Double = $(minDF)
     
       /**
    -   * Specifies the maximum number of different documents a term must 
appear in to be included
    -   * in the vocabulary.
    -   * If this is an integer greater than or equal to 1, this specifies the 
number of documents
    -   * the term must appear in; if this is a double in [0,1), then this 
specifies the fraction of
    -   * documents.
    +   * maxDF is used for removing terms that appear too frequently. It 
specifies the maximum number
    +   * of different documents a term could appear in to be included in the 
vocabulary.
    +   * If this is an integer greater than or equal to 1, this specifies the 
maximum number of
    +   * documents the term could appear in; if this is a double in [0,1), 
then this specifies the
    +   * maximum fraction of documents the term could appear in. A term 
appears more frequently
    +   * than maxDF will be removed.
        *
    -   * Default: (2^64^) - 1
    +   * Default: (2^63) - 1
    --- End diff --
    
    good catch!



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20777: [SPARK-23615][ML][PYSPARK]Add maxDF Parameter to ...

Reply via email to