[ https://issues.apache.org/jira/browse/SPARK-27542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-27542: ------------------------------- Summary: SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats (was: SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs for some legacy OutputFormats) > SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using > certain legacy OutputFormats > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-27542 > URL: https://issues.apache.org/jira/browse/SPARK-27542 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.4.0 > Reporter: Josh Rosen > Priority: Major > > In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} > after configuring the output committer: > [https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611] > > Spark doesn't do this: > [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115] > As a result, certain legacy output formats can fail to work out-of-the-box on > Spark. In particular, > {{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail > with NullPointerExceptions, e.g. > {code:java} > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.<init>(Path.java:105) > at org.apache.hadoop.fs.Path.<init>(Path.java:94) > at > org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69) > [...] > at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96) > {code} > It looks like someone on GitHub has hit the same problem: > https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe > Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348 > We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure > of whether that change would pose compatibility risks for other existing > workloads, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org