[ 
https://issues.apache.org/jira/browse/SPARK-40541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617439#comment-17617439
 ] 

Ivan Sadikov commented on SPARK-40541:
--------------------------------------

What is the question here? Does marking column as nullable make it work? 

Code generation inlines the field access if you don't have nullability, so we 
know the field is non-nullable. Making it nullable should add the check and 
mitigate the issue.

> NullPointerException with UTF8String.getBaseObject() when UDF
> -------------------------------------------------------------
>
>                 Key: SPARK-40541
>                 URL: https://issues.apache.org/jira/browse/SPARK-40541
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Garret Wilson
>            Priority: Major
>
> I'm using Spark 3.3.0 on Windows with Java 17. I have a UDF that returns 
> several columns using:
> {code}
> StructType schema = createStructType(List.of(… createStructField("bar", 
> StringType, false)));
> UserDefinedFunction foobarUdf = udf((String foo) -> {
>   …
> }, schema).asNondeterministic();
> {code}
> Note that I specify {{false}} for {{bar}}'s nullability. It turns out that 
> {{foobarUdf}} actually returns {{null}} for {{bar}} sometimes. In the 
> relational database world, I would expect that if my integrity constraint 
> wasn't met, the database would say, "you put {{null}} in {{bar}}, but {{bar}} 
> is not nullable".
> What I did _not_ expect is what Spark does: it hits a 
> {{NullPointerException}} and has a nervous breakdown:
> {noformat}
> [ERROR] Exception in task 0.0 in stage 8.0 (TID 5)
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is 
> null
>         at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> [WARN] Lost task 0.0 in stage 8.0 (TID 5) (xps-13-9310 executor driver): 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is 
> null
>         at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> {noformat}
> It finally fails with:
> {noformat}
> [ERROR] Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, 
> most recent failure: Lost task 0.0 in stage 8.0 (TID 5) (xps-13-9310 executor 
> driver): java.lang.NullPointerException: Cannot invoke 
> "org.apache.spark.unsafe.types.UTF8String.getBaseObject()" because "input" is 
> null
>         at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>         at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_1$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.hashAgg_doAggregateWithoutKey_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>         at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>         at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>         at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>         at org.apache.spark.scheduler.Task.run(Task.scala:136)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>         at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:833)
> {noformat}
> See also [Spark dies with NullPointerException UTF8String.getBaseObject() 
> "input" is null|https://stackoverflow.com/q/73815800].
> The irony is that when I actually intend to mark a column as non-nullable 
> when reading in data, Spark ignores me. See [Spark 3.3.0 not honoring my 
> schema nullability in Java|https://stackoverflow.com/q/73476202].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to