sam created SPARK-26770: --------------------------- Summary: Misleading/unhelpful error message when wrapping a null in an Option Key: SPARK-26770 URL: https://issues.apache.org/jira/browse/SPARK-26770 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: sam
This {code} // Using options to indicate nullable fields case class Product(productID: Option[Int], productName: Option[String]) val productExtract: Dataset[Product] = spark.createDataset(Seq( Product( productID = Some(6050286), // user mistake here, should be `None` not `Some(null)` productName = Some(null) ))) productExtract.count() {code} will give an error like the one below. This error is thrown from quite deep down, but there should be some handling logic further up to check for nulls and to give a more informative error message. E.g. it could tell the user which field is null, it could detect the `Some(null)` error and suggest using `None`. Whatever the exception it shouldn't be NPE, since this is clearly a user error, so should be some kind of user error exception. {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID 276, 10.139.64.8, executor 1): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:620) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I've seen quite a few other people with this error, but I don't think it's for the same reason: https://docs.databricks.com/spark/latest/data-sources/tips/redshift-npe.html https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/Dt6ilC9Dn54 https://issues.apache.org/jira/browse/SPARK-17195 https://issues.apache.org/jira/browse/SPARK-18859 https://github.com/datastax/spark-cassandra-connector/issues/1062 https://stackoverflow.com/questions/39875711/spark-sql-2-0-nullpointerexception-with-a-valid-postgresql-query -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org