[ https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634085#comment-17634085 ]
Apache Spark commented on SPARK-40199: -------------------------------------- User 'xkrogen' has created a pull request for this issue: https://github.com/apache/spark/pull/38660 > Spark throws NPE without useful message when NULL value appears in non-null > schema > ---------------------------------------------------------------------------------- > > Key: SPARK-40199 > URL: https://issues.apache.org/jira/browse/SPARK-40199 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.2.2 > Reporter: Erik Krogen > Priority: Major > > Currently in some cases, if Spark encounters a NULL value where the schema > indicates that the column/field should be non-null, it will throw a > {{NullPointerException}} with no message and thus no way to debug further. > This can happen, for example, if you have a UDF which is erroneously marked > as {{asNonNullable()}}, or if you read input data where the actual values > don't match the schema (which could happen e.g. with Avro if the reader > provides a schema declaring non-null although the data was written with null > values). > As an example of how to reproduce: > {code:scala} > val badUDF = spark.udf.register[String, Int]("bad_udf", in => > null).asNonNullable() > Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect() > {code} > This throws an exception like: > {code} > Driver stacktrace: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1) (xxxxxxxxxx executor driver): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) > at org.apache.spark.scheduler.Task.run(Task.scala:139) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > As a user, it is very confusing -- it looks like there is a bug in Spark. We > have had many users report such problems, and though we can guide them to a > schema-data mismatch, there is no indication of what field might contain the > bad values, so a laborious data exploration process is required to find and > remedy it. > We should provide a better error message in such cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org