Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19848 > I was hoping you would know the hadoop committer semantics better than me I might, but that's only because I spent time with a debugger and asking people the history of things, which is essentially an oral folklore of "how things failed". Suffice to say: the google paper left some important details out. Best public documentation, [Task Committers](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committer_architecture.md). I'm doing a proper paper on it, but it's not ready yet. * Every Job attempt must have the same JobID. That's for both cleanup and for `FIleOutputCommiter` recovery. Less relevant for Spark as it doesn't do driver restart. * Job ID must be the same on driver and executor, as that's how the driver knows where to look for output. * All attempts to work on the same data part must have the same JobID, TaskID and different TaskAttemptID. That's critical to deal with task commit concurrency resolution, specifically the failure "task attempt 1 given permission to commit, does the commit, doesn't return, TA 2 kicked off" Having the same task Id and some guarantee in the commit protocol to overwrite any existing attempt guarantees that only one attempt's output is committed by the job, even if two task attempts actually commit their work * V1 job recovery expects task & job attempt IDs to be sequential on a task, used to find out what the working dir of previous attempts would be, and assumes that if attemptID==0, there's no previous attempt to recover * No expectation of taskID uniqueness across jobs. I don't know about broader uniqueness of things like across a full filesystem. Could matter if people were playing with temporary paths, but the convention to put everything under `$dest/_temporary/$jobAttemptId` means that you only need uniqueness amongst all jobs writing to the same destination path * S3A committers want unique paths to put things, staging committer: in HDFS, local FS; Magic, * Stocator expects the job ID to be unique through the job, again, doesn't care about global uniqueness * I don't know about committers to other destinations than HDFS or S3 One use case to consider, and @rdblue will have opinions there, is >1 job doing an append write to the same destination path. Every jobID must be unique enough to guarantee that the two independent jobs (even in different spark clusters) must be able to have their own intermediate datasets not conflict, even when created with the same parent dir w.r.t Spark,. StageID => Job ID everywhere is needed for both sides of the committer to be consistent * and have a unique stage ID will potentially line you up for interesting things later. Put differently: if its easy enough to be unique, why wouldn't you. * Side issue: the hadoop code to parse task, job, attempt ID code is pretty brittle. Tread carefully, never call toString() on them in a log statement,
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org