Marie Beaulieu created SPARK-25289: -------------------------------------- Summary: ChiSqSelector max on empty collection Key: SPARK-25289 URL: https://issues.apache.org/jira/browse/SPARK-25289 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.3.1 Reporter: Marie Beaulieu
In org.apache.spark.mllib.feature.ChiSqSelector.fit, there is a max taken on a possibly empty collection. I am using Spark 2.3.1. Here is an example to reproduce. {code:java} import org.apache.spark.mllib.feature.ChiSqSelector import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) implicit val spark = sqlContext.sparkSession val labeledPoints = (0 to 1).map(n => { val v = Vectors.dense((1 to 3).map(_ => n * 1.0).toArray) LabeledPoint(n.toDouble, v) }) val rdd = sc.parallelize(labeledPoints) val selector = new ChiSqSelector().setSelectorType("fdr").setFdr(0.05) selector.fit(rdd){code} Here is the stack trace: {code:java} java.lang.UnsupportedOperationException: empty.max at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234) at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:280) {code} Looking at line 280 in ChiSqSelector, it's pretty obvious how the collection can be empty. A simple non empty validation should do the trick. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org