There's been discussion going on in various PRs about what committers do, are 
expected to do, and how they get coordinated; a general conclusion to these is 
"this should be covered in the developer list"

Here then, are the 3 PRs where this has surfaced.


[SPARK-22026][SQL] data source v2 write path 
https://github.com/apache/spark/pull/19269

[SPARK-22078][SQL] clarify exception behaviors for all data source v2 
interfaces  https://github.com/apache/spark/pull/19623

SPARK-22162] Executors and the driver should use consistent JobIDs in the RDD 
commit protocol : https://github.com/apache/spark/pull/19848

Right now, the Hadoop side of things is non-normatively written up in
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md

with some errata in a WiP patch

https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-15107-correctness/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md

Those docs are incomplete, and I don't know of anything equivalent covering the 
Spark driver's commit algorithm, so it's mostly been a matter of tracing back 
through the IDE and having a modified committer set to do things like fail in 
task or job commit.

Having spent time integrating Hadoop's forthcoming S3A committers with things, 
I suspect that there may be some mismatch of expectations of committers & what 
they deliver, but I'll need to add a bit more fault injection there to be sure. 
I'll have a draft of a paper up in a week or so for anyone interested in this 
area

-Steve


Reply via email to