[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

Steve Loughran (JIRA) Tue, 19 Sep 2017 02:45:13 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171420#comment-16171420
 ]


Steve Loughran commented on SPARK-21549:
----------------------------------------

# you can't rely on the committers having output and temp dirs. Subclasses of 
{{FileOutputCommitter}} *must*, though there's no official mechanism for 
querying that because {{getOutputPath()}} is private.
# Hadoop 3.0 has added (MAPREDUCE-6956) a new superclass of 
{{FileOutputCommitter}}, 
[PathOutputCommitter|[https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitter.java]
 which pulls up getWorkingDir method to be more general (so that you can have 
output committers which tell spark and hive where their intermediate data 
should go, without them being subclasses of FileOutputCommitter.
# I'm happy to pull up {{getOutputPath}} to that class, and if people can give 
me a patch for it *this week* I'll add it for 3.0 beta 1. 

Regarding the committer here, you might want to think of moving off/subclassing 
{{HadoopMapReduceCommitProtocol}}. This is what I've done in 
[PathOutputCommitProtoco|https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/hortonworks/spark/cloud/PathOutputCommitProtocol.scala],
 though I can see it's still relying on the superclass to get that properly. 
Again, if we can patch the new PathOutputCommitter class for a getOutputPath I 
use that here. And yes, while that new mapreduce will take a long time to 
surface in spark core, you can use it independently, from later this year..

If you are playing with different committers out of spark's own codebase, pick 
up the the ORC hive tests from 
[https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources].
 These are just some of the spark sql tests reworked slightly so that they'll 
work with any FileSystem impl. rather than just local file:// paths

Ping me direct if you are playing with new committers, & look at MAPREDUCE-6823 
to see if that'd be useful to you (& how it could be improved, given its still 
not in the codebase)



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21549
>                 URL: https://issues.apache.org/jira/browse/SPARK-21549
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: spark 2.2.0
> scala 2.11
>            Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:135)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

Reply via email to