Thanks Sean for the quick response. Logged a Jira: https://issues.apache.org/jira/browse/SPARK-32662
Will send a pull request shortly. Regards, Jatin On Wed, Aug 19, 2020 at 6:58 PM Sean Owen <sro...@gmail.com> wrote: > I think that's true. You're welcome to open a pull request / JIRA to > remove that requirement. > > On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri <purija...@gmail.com> wrote: > > > > Hello, > > > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 > > > > require(vocab.length > 0, "The vocabulary size should be > 0. Lower > minDF as necessary.") > > > > Currently, if `CountVectorizer` is trained on an empty dataset results > in the following exception. But it is perfectly valid use case to send it > empty data (or if minDF filters everything). > > HashingTF works fine in such scenarios. CountVectorizer doesn't. > > > > Can we remove this constraint? Happy to send a pull-request > > > > java.lang.IllegalArgumentException: requirement failed: The vocabulary > size should be > 0. Lower minDF as necessary. > > at scala.Predef$.require(Predef.scala:224) > > at > org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236) > > at > org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149) > > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153) > > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149) > > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > -- Jatin Puri http://jatinpuri.com <http://www.jatinpuri.com>