Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19269 One other thing that would be good now and invaluable in future is for the `DataWriter.commit()` call to return a `Map[String,Long]` of statistics alongside the message sent to the committer. The spec should say these statistics "MUST be specific to the writer/its thread", so that aggregating the stats across all tasks produces valid output. What does this to? Lets the writers provide real statistics about the cost of operations. If you look at the changes to {{FileFormatWriter}} to collect stats, it's just listing the written file after close() and returning file size as its sole metric. We are moving to instrumenting more of the Hadoop output streams with statistics collection, and, once there's an API for getting at the values, would allow the driver to aggregate stats from the writer for the writes and the commits. Examples: bytes written, files written, records written, # of 503/throttle events sent back from S3, # of network failures and retried operations, ...etc. Once the writers start collecting this data, there's motivation for the layers underneath to collect more and publish what they get. As an example, here's [the data collected by `S3AOutputStream`](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInstrumentat ion.java#L765), exposed via `OutputStream.toString()` as well as fed to hadoop metrics. As well as bytes written, it tracks blocks PUT, retries on completion operations, and how many times a block PUT failed and had to be retried. That means that a job can have results like "it took X seconds and wrote Y bytes but it had to repost Z bytes of data, which made things slow" There's a side issue: what is the proposed mandated re-entrancy policy in the writers? Is expectation that `DataWriter.write()` will always be called by a single thread, and therefore no need to implement thread safe writes to the output stream, or is there a risk that >1 thread may write sequentially (preventing thread local storage for collecting stats) or even simultaneously. (if so, the example is in trouble as the java.io APIs say no need to support re-entrancy, even if HDFS does. Again, this is the kind of thing where some specification can highlight the policy, otherwise people will code against the implementation, which is precisely why HDFS DFSOutputStreams are stuck doing thread-safety writes (HBase, see).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org