[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331194#comment-16331194 ]
Apache Spark commented on SPARK-23152: -------------------------------------- User 'tovbinm' has created a pull request for this issue: https://github.com/apache/spark/pull/20321 > Invalid guard condition in org.apache.spark.ml.classification.Classifier > ------------------------------------------------------------------------ > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 > Reporter: Matthew Tovbin > Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org