[jira] [Updated] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

Josh Rosen (JIRA) Mon, 22 Apr 2019 16:54:25 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen updated SPARK-27542:
-------------------------------
    Summary: SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs 
when using certain legacy OutputFormats  (was: SparkHadoopWriter doesn't set 
call setWorkOutputPath, causing NPEs for some legacy OutputFormats)

> SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using 
> certain legacy OutputFormats
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-27542
>                 URL: https://issues.apache.org/jira/browse/SPARK-27542
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.4.0
>            Reporter: Josh Rosen
>            Priority: Major
>
> In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} 
> after configuring the  output committer: 
> [https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611]
>  
> Spark doesn't do this: 
> [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115]
> As a result, certain legacy output formats can fail to work out-of-the-box on 
> Spark. In particular, 
> {{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail 
> with NullPointerExceptions, e.g.
> {code:java}
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.<init>(Path.java:105)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:94)
>   at 
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
> [...]
>   at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
> {code}
> It looks like someone on GitHub has hit the same problem: 
> https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe
> Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348
> We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure 
> of whether that change would pose compatibility risks for other existing 
> workloads, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

Reply via email to