[GitHub] [spark] steveloughran opened a new pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

GitBox Fri, 23 Oct 2020 07:13:31 -0700


steveloughran opened a new pull request #30141:
URL: https://github.com/apache/spark/pull/30141



   ### What changes were proposed in this pull request?
   
   This reinstates the old option `spark.sql.sources.write.jobUUID` to set a 
unique jobId in the jobconf so that hadoop MR committers have a unique ID which 
is (a) consistent across tasks and workers and (b) not brittle compared to 
generated-timestamp job IDs. The latter matches that of what JobID requires, 
but as they are generated per-thread, may not always be unique within a cluster.
   
   Testing: no test here. You'd have to create a new committer which extracted 
the value in both job and task(s) and verified consistency. That is possible 
(with a task output whose records contained the UUID), but it would you be 
pretty convoluted and a high maintenance cost.
          
   ### Why are the changes needed?
   
   If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique 
ID then any two jobs started within the same second have the same ID, so can 
clash.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Good Q. It is "developer-facing" in the context of anyone writing a 
committer. But it reinstates a property which was in Spark 1.x and "went away"
   
   ### How was this patch tested?
   
   Testing: no test here. You'd have to create a new committer which extracted 
the value in both job and task(s) and verified consistency. That is possible 
(with a task output whose records contained the UUID), but it would you be 
pretty convoluted and a high maintenance cost.
          
   Because it's trying to address a race condition, it's hard to regenerate the 
problem downstream and so verify a fix in a test run...I'll just look at the 
logs to see what temporary dir is being used in the cluster FS and verify it's 
a UUIDYou


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] steveloughran opened a new pull request #30141: SPARK-33230. FileOutputWriter to set jobConf "spark.sql.sources.write.jobUUID" to description.uuid

Reply via email to