Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20367#discussion_r163911373
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
    @@ -160,6 +187,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
         } else {
           $(minDF) * input.cache().count()
         }
    +    val maxDf = if ($(maxDF) >= 1.0) {
    +      $(maxDF)
    +    } else {
    +      $(maxDF) * input.cache().count()
    +    }
    --- End diff --
    
    I wonder if it's best to check that the max is >= the min? this may be the 
only place that can be done as one might be a fraction and the other not.
    
    Also this ends up counting the input twice when both params are set, which 
might be worth avoiding. It could be computed and saved only when needed.
    
    (I suppose this also fails to unpersist `input`, which could happen right 
after `wordCounts.count()`. That's a slightly different issue, so not too 
concerned about whether to fix it here or not.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to