Hello,

This is wrt
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244

require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF
as necessary.")

Currently, if `CountVectorizer` is trained on an empty dataset results in
the following exception. But it is perfectly valid use case to send it
empty data (or if minDF filters everything).
HashingTF works fine in such scenarios. CountVectorizer doesn't.

Can we remove this constraint? Happy to send a pull-request

java.lang.IllegalArgumentException: requirement failed: The vocabulary
size should be > 0. Lower minDF as necessary.   at
scala.Predef$.require(Predef.scala:224) at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)      
at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)      
at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)   at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)   at
scala.collection.Iterator$class.foreach(Iterator.scala:891)     at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

Reply via email to