[jira] [Comment Edited] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests

2016-05-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280782#comment-15280782
 ] 

Colin Patrick McCabe edited comment on HADOOP-13028 at 5/11/16 8:39 PM:


In the past I've written code for Spark that used reflection to make use of 
APIs that may or may not be present in Hadoop.  HBase often does this as well, 
so that it can use multiple versions of Hadoop.  It seems like this wouldn't be 
a lot of code.  Is that feasible in this case?

I just find the argument that we should overload an existing unrelated API to 
output statistics very off-putting.  It's like saying we should override 
hashCode to output the number of times the user called {{seek()}} on the stream.

I guess you could argue that the statistics is part of the stream state, and 
toString is intended to reflect stream state.  But it will result in very long 
output from toString which probably isn't what most existing callers want.  And 
it's not consistent with the way any other hadoop streams work, including other 
s3 ones like s3n.


was (Author: cmccabe):
In the past I've written code for Spark that used reflection to make use of 
APIs that may or may not be present in Hadoop.  HBase often does this as well, 
so that it can use multiple versions of Hadoop.  It seems like this wouldn't be 
a lot of code.  Is that feasible in this case?

I just find the argument that we should overload an existing unrelated API to 
output statistics very off-putting.  It's like saying we should override 
hashCode to output the number of times the user called {{seek()}} on the 
stream.  I also find it concerning that this would be something unique to s3a 
and not present in the toString methods of any other filesystem (including the 
other s3 ones).  It feels like a gross hack.

> add low level counter metrics for S3A; use in read performance tests
> 
>
> Key: HADOOP-13028
> URL: https://issues.apache.org/jira/browse/HADOOP-13028
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, metrics
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, 
> HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, 
> HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, 
> HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, 
> HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, 
> org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, 
> org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt
>
>
> against S3 (and other object stores), opening connections can be expensive, 
> closing connections may be expensive (a sign of a regression). 
> S3A FS and individual input streams should have counters of the # of 
> open/close/failure+reconnect operations, timers of how long things take. This 
> can be used downstream to measure efficiency of the code (how often 
> connections are being made), connection reliability, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests

2016-05-11 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280782#comment-15280782
 ] 

Colin Patrick McCabe edited comment on HADOOP-13028 at 5/11/16 8:43 PM:


In the past I've written code for Spark that used reflection to make use of 
APIs that may or may not be present in Hadoop.  HBase often does this as well, 
so that it can use multiple versions of Hadoop.  It seems like this wouldn't be 
a lot of code.  Is that feasible in this case?

I just find the argument that we should overload an existing unrelated API to 
output statistics very off-putting.  I guess you could argue that the 
statistics is part of the stream state, and toString is intended to reflect 
stream state.  But it will result in very long output from toString which 
probably isn't what most existing callers want.  And it's not consistent with 
the way any other hadoop streams work, including other s3 ones like s3n.

[~andrew.wang], [~cnauroth], [~liuml07], what do you think about this?  Is it 
acceptable to overload {{toString}} in this way, to output statistics?  The 
argument seems to be that this easier than using reflection to get the actual 
stream statistics object.


was (Author: cmccabe):
In the past I've written code for Spark that used reflection to make use of 
APIs that may or may not be present in Hadoop.  HBase often does this as well, 
so that it can use multiple versions of Hadoop.  It seems like this wouldn't be 
a lot of code.  Is that feasible in this case?

I just find the argument that we should overload an existing unrelated API to 
output statistics very off-putting.  It's like saying we should override 
hashCode to output the number of times the user called {{seek()}} on the stream.

I guess you could argue that the statistics is part of the stream state, and 
toString is intended to reflect stream state.  But it will result in very long 
output from toString which probably isn't what most existing callers want.  And 
it's not consistent with the way any other hadoop streams work, including other 
s3 ones like s3n.

> add low level counter metrics for S3A; use in read performance tests
> 
>
> Key: HADOOP-13028
> URL: https://issues.apache.org/jira/browse/HADOOP-13028
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3, metrics
>Affects Versions: 2.8.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, 
> HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, 
> HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, 
> HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, 
> HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, 
> org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, 
> org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt
>
>
> against S3 (and other object stores), opening connections can be expensive, 
> closing connections may be expensive (a sign of a regression). 
> S3A FS and individual input streams should have counters of the # of 
> open/close/failure+reconnect operations, timers of how long things take. This 
> can be used downstream to measure efficiency of the code (how often 
> connections are being made), connection reliability, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org