sam created SPARK-26770:
---------------------------

             Summary: Misleading/unhelpful error message when wrapping a null 
in an Option
                 Key: SPARK-26770
                 URL: https://issues.apache.org/jira/browse/SPARK-26770
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.2
            Reporter: sam


This

{code}

// Using options to indicate nullable fields
case class Product(productID: Option[Int],
                               productName: Option[String])

val productExtract: Dataset[Product] =
        spark.createDataset(Seq(
          Product(
            productID = Some(6050286),
            // user mistake here, should be `None` not `Some(null)`
            productName = Some(null)
          )))

productExtract.count()

{code}

will give an error like the one below.  This error is thrown from quite deep 
down, but there should be some handling logic further up to check for nulls and 
to give a more informative error message.  E.g. it could tell the user which 
field is null, it could detect the `Some(null)` error and suggest using `None`.

Whatever the exception it shouldn't be NPE, since this is clearly a user error, 
so should be some kind of user error exception.

{code}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 
1.0 (TID 276, 10.139.64.8, executor 1): java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:620)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:112)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:384)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

{code}

I've seen quite a few other people with this error, but I don't think it's for 
the same reason:

https://docs.databricks.com/spark/latest/data-sources/tips/redshift-npe.html
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/Dt6ilC9Dn54
https://issues.apache.org/jira/browse/SPARK-17195
https://issues.apache.org/jira/browse/SPARK-18859
https://github.com/datastax/spark-cassandra-connector/issues/1062
https://stackoverflow.com/questions/39875711/spark-sql-2-0-nullpointerexception-with-a-valid-postgresql-query



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to