[ https://issues.apache.org/jira/browse/SPARK-49812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885297#comment-17885297 ]
Martin Andersson commented on SPARK-49812: ------------------------------------------ Changing the schema to a single column results in a single row data frame with a null value for zstd compressed files. {code} +--------+ |some_col| +--------+ | NULL| +--------+ {code} > NPE when reading empty zstd compressed csv file > ----------------------------------------------- > > Key: SPARK-49812 > URL: https://issues.apache.org/jira/browse/SPARK-49812 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.5.3 > Environment: Ubuntu 22.04 > Java 17 > spark 3.5.3 > Reporter: Martin Andersson > Priority: Minor > > Reading an empty zstd compressed csv file results in a NPE. The same file > works fine when not compressed. > {code:sh} > $touch empty.csv > $zstd < empty.csv > empty.csv.zst > {code} > This works as expected - resulting in an empty DataFrame. > {code:java} > spark.read() > .option("header", "false") > .option("lineSep", "|") > .option("multiLine", "true") > .option("quote", "") > .schema("some_col string, other_col string") > .csv("empty.csv") > .show(); > {code} > Changing the path to "empty.csv.zst" triggers an exception. The exception is > only trigger for zstd files when both properties "multiLine" and "quote" are > set. > {code:java} > INFO DAGScheduler: ResultStage 0 (show at Main.java:24) failed in 0.408 s due > to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most > recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.32.18 executor > driver): org.apache.spark.SparkException: Encountered error while reading > file file:///tmp/empty.csv.zst. Details: > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:864) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129) > at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:840) > Caused by: java.lang.NullPointerException: Cannot invoke > "org.apache.spark.unsafe.types.UTF8String.toString()" because "currentInput" > is null > at > org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:333) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseStream$1(UnivocityParser.scala:400) > at > org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseStream$3(UnivocityParser.scala:409) > at > org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.next(UnivocityParser.scala:432) > at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) > at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283) > ... 22 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org