[ https://issues.apache.org/jira/browse/HDFS-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640739#comment-13640739 ]
Colin Patrick McCabe commented on HDFS-4698: -------------------------------------------- Thanks for the review. Looking at this again, I think it would not be too difficult to update the stats after every read, rather than just after close. That will make it a lot more useful, I think. bq. Recommend adding comments to DFSInputStream.ReadStatistics explaining the meaning of the various fields, i.e. that SCR bytes will count for both SCR and "local bytes", that total >= local >= SCR, that remote bytes read can be determined by total - local, etc. OK. bq. For that matter, you might want to add a getRemoteBytesRead method to DFSInputStream.ReadStatistics to do the subtraction for the user. Will add. bq. Any thoughts about how this new feature should interact with the existing FileSystem#Statistics class? Valid answers include "not at all" and/or "this will be helpful as-is, we can think about that later." The per-file metrics introduced by this change are helpful when you just want to know how many bytes you've read out of a currently open file, or whether or not you are getting short-circuit local reads most of the time when reading a currently open file. {{FileSystem#Statistics}} is more about aggregate statistics for the client as a whole. I'm a little more hesitant to add this kind of information to there for a few reasons. One is that {{FileSystem#Statistics}} is supposed to be generic to all filesystems, but SCR is somewhat of an implementation detail of HDFS. The other is that we currently update those stats from multiple threads after every read or write operation. So they are correspondingly more expensive in terms of CPU time. I think the tl;dr is that we should think about that later. > provide client-side metrics for remote reads, local reads, and short-circuit > reads > ---------------------------------------------------------------------------------- > > Key: HDFS-4698 > URL: https://issues.apache.org/jira/browse/HDFS-4698 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client > Affects Versions: 2.0.4-alpha > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Priority: Minor > Attachments: HDFS-4698.001.patch > > > We should provide metrics to let clients know how many bytes of data they > have read remotely, versus locally or via short-circuit local reads. This > will allow clients to know how well they're doing at bringing the computation > to the data, which will be useful in evaluating placement policies and > cluster configurations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira