Matthew Tovbin created SPARK-23152: -------------------------------------- Summary: Invalid guard condition in org.apache.spark.ml.classification.Classifier Key: SPARK-23152 URL: https://issues.apache.org/jira/browse/SPARK-23152 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.3, 2.3.0, 2.3.1 Reporter: Matthew Tovbin
When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org