[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

steveloughran Sat, 30 Sep 2017 09:34:01 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19269
  
    One other thing that would be good now and invaluable in future is for the 
`DataWriter.commit()` call to return a `Map[String,Long]` of statistics 
alongside the message sent to the committer. The spec should say these 
statistics "MUST be specific to the writer/its thread", so that aggregating the 
stats across all tasks produces valid output.
    
    What does this to? Lets the writers provide real statistics about the cost 
of operations. If you look at the changes to {{FileFormatWriter}} to collect 
stats, it's just listing the written file after close() and returning file size 
as its sole metric. We are moving to instrumenting more of the Hadoop output 
streams with statistics collection, and, once there's an API for getting at the 
values, would allow the driver to aggregate stats from the writer for the 
writes and the commits. Examples: bytes written, files written,  records 
written, # of 503/throttle events sent back from S3, # of network failures and 
retried operations, ...etc. Once the writers start collecting this data, 
there's motivation for the layers underneath to collect more and publish what 
they get. As an example, here's [the data collected by 
`S3AOutputStream`](https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInstrumentat
 ion.java#L765), exposed via `OutputStream.toString()` as well as fed to hadoop 
metrics. As well as bytes written, it tracks blocks PUT, retries on completion 
operations, and how many times a block PUT failed and had to be retried. That 
means that a job can have results like "it took X seconds and wrote Y bytes but 
it had to repost Z bytes of data, which made things slow"
    
    There's a side issue: what is the proposed mandated re-entrancy policy in 
the writers?
    
    Is expectation that `DataWriter.write()` will always be called by a single 
thread, and therefore no need to implement thread safe writes to the output 
stream, or is there a risk that >1 thread may write sequentially (preventing 
thread local storage for collecting stats) or even simultaneously. (if so, the 
example is in trouble as the java.io APIs say no need to support re-entrancy, 
even if HDFS does. Again, this is the kind of thing where some specification 
can highlight the policy, otherwise people will code against the 
implementation, which is precisely why HDFS DFSOutputStreams are stuck doing 
thread-safety writes (HBase, see).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19269: [SPARK-22026][SQL][WIP] data source v2 write path

Reply via email to