Ability to have CountVectorizerModel vocab as empty

Jatin Puri Wed, 19 Aug 2020 01:21:53 -0700

Hello,

This is wrt
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244


require(vocab.length > 0, "The vocabulary size should be > 0. Lower minDF
as necessary.")

Currently, if `CountVectorizer` is trained on an empty dataset results in
the following exception. But it is perfectly valid use case to send it
empty data (or if minDF filters everything).
HashingTF works fine in such scenarios. CountVectorizer doesn't.

Can we remove this constraint? Happy to send a pull-request

java.lang.IllegalArgumentException: requirement failed: The vocabulary
size should be > 0. Lower minDF as necessary.   at
scala.Predef$.require(Predef.scala:224) at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:236)      
at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:149)      
at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)   at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)   at
scala.collection.Iterator$class.foreach(Iterator.scala:891)     at
scala.collection.AbstractIterator.foreach(Iterator.scala:1334)

Ability to have CountVectorizerModel vocab as empty

Reply via email to