[ 
https://issues.apache.org/jira/browse/HDFS-4698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640739#comment-13640739
 ] 

Colin Patrick McCabe commented on HDFS-4698:
--------------------------------------------

Thanks for the review.  Looking at this again, I think it would not be too 
difficult to update the stats after every read, rather than just after close.  
That will make it a lot more useful, I think.

bq. Recommend adding comments to DFSInputStream.ReadStatistics explaining the 
meaning of the various fields, i.e. that SCR bytes will count for both SCR and 
"local bytes", that total >= local >= SCR, that remote bytes read can be 
determined by total - local, etc.

OK.

bq. For that matter, you might want to add a getRemoteBytesRead method to 
DFSInputStream.ReadStatistics to do the subtraction for the user.

Will add.

bq. Any thoughts about how this new feature should interact with the existing 
FileSystem#Statistics class? Valid answers include "not at all" and/or "this 
will be helpful as-is, we can think about that later."

The per-file metrics introduced by this change are helpful when you just want 
to know how many bytes you've read out of a currently open file, or whether or 
not you are getting short-circuit local reads most of the time when reading a 
currently open file.  {{FileSystem#Statistics}} is more about aggregate 
statistics for the client as a whole.  I'm a little more hesitant to add this 
kind of information to there for a few reasons.  One is that 
{{FileSystem#Statistics}} is supposed to be generic to all filesystems, but SCR 
is somewhat of an implementation detail of HDFS.  The other is that we 
currently update those stats from multiple threads after every read or write 
operation.  So they are correspondingly more expensive in terms of CPU time.  I 
think the tl;dr is that we should think about that later.
                
> provide client-side metrics for remote reads, local reads, and short-circuit 
> reads
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-4698
>                 URL: https://issues.apache.org/jira/browse/HDFS-4698
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.0.4-alpha
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>         Attachments: HDFS-4698.001.patch
>
>
> We should provide metrics to let clients know how many bytes of data they 
> have read remotely, versus locally or via short-circuit local reads.  This 
> will allow clients to know how well they're doing at bringing the computation 
> to the data, which will be useful in evaluating placement policies and 
> cluster configurations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to