[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348567#comment-15348567 ] Colin Patrick McCabe commented on HADOOP-12975: --- Thanks for the heads up, [~vinayrpet]. I fixed the accidental revert of HADOOP-13072. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Fix For: 2.8.0 > > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch, HADOOP-12975v6.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13305) Define common statistics names across schemes
[ https://issues.apache.org/jira/browse/HADOOP-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345784#comment-15345784 ] Colin Patrick McCabe commented on HADOOP-13305: --- Great idea, [~liuml07]. Should some of these variables be {{final}}? > Define common statistics names across schemes > - > > Key: HADOOP-13305 > URL: https://issues.apache.org/jira/browse/HADOOP-13305 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13305.000.patch > > > The {{StorageStatistics}} provides a pretty general interface, i.e. > {{getLong(name)}} and {{getLongStatistics()}}. There is no shared or standard > names for the storage statistics and thus the getLong(name) is up to the > implementation of storage statistics. The problems: > # For the common statistics, downstream applications expect the same > statistics name across different storage statistics and/or file system > schemes. Chances are they have to use > {{DFSOpsCountStorageStatistics#getLong(“getStatus”)}} and > {{S3A.Statistics#getLong(“get_status”)}} for retrieving the getStatus > operation stat. > # Moreover, probing per-operation stats is hard if there is no > standard/shared common names. > It makes a lot of sense for different schemes to issue the per-operation > stats of the same name. Meanwhile, every FS will have its own internal things > to count, which can't be centrally defined or managed. But there are some > common which would be easier managed if they all had the same name. > Another motivation is that having a common set of names here will encourage > uniform instrumentation of all filesystems; it will also make it easier to > analyze the output of runs, were the stats to be published to a "performance > log" similar to the audit log. See Steve's work for S3 (e.g. [HADOOP-13171]) > This jira is track the effort of defining common StorageStatistics entry > names. Thanks to [~cmccabe], [~ste...@apache.org], [~hitesh] and [~jnp] for > offline discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340829#comment-15340829 ] Colin Patrick McCabe commented on HADOOP-12949: --- Yeah, we certainly could use the UA header for this. That assumes that Amazon's s3 implementation will start looking for this (which maybe they will?). In the short term, the big win will be just connecting up the job being run with the operations being done at the s3a level. > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13288) Guard null stats key in FileSystemStorageStatistics
[ https://issues.apache.org/jira/browse/HADOOP-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13288: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) > Guard null stats key in FileSystemStorageStatistics > --- > > Key: HADOOP-13288 > URL: https://issues.apache.org/jira/browse/HADOOP-13288 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13288.000.patch, HADOOP-13288.001.patch > > > Currently in {{FileSystemStorageStatistics}} we simply returns data from > {{FileSystem#Statistics}}. However there is no null key check, which leads to > NPE problems to downstream applications. For example, we got a NPE when > passing a null key to {{FileSystemStorageStatistics#getLong()}}, exception > stack as following: > {quote} > NullPointerException > at > org.apache.hadoop.fs.FileSystemStorageStatistics.fetch(FileSystemStorageStatistics.java:80) > at > org.apache.hadoop.fs.FileSystemStorageStatistics.getLong(FileSystemStorageStatistics.java:108) > at > org.apache.tez.runtime.metrics.FileSystemStatisticsUpdater2.updateCounters(FileSystemStatisticsUpdater2.java:60) > at > org.apache.tez.runtime.metrics.TaskCounterUpdater.updateCounters(TaskCounterUpdater.java:118) > at > org.apache.tez.runtime.RuntimeTask.setFrameworkCounters(RuntimeTask.java:172) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:100) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > This jira is to add null stat key check to {{FileSystemStorageStatistics}}. > Thanks [~hitesh] for trying in Tez and reporting this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13280) FileSystemStorageStatistics#getLong(“readOps“) should return readOps + largeReadOps
[ https://issues.apache.org/jira/browse/HADOOP-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13280: -- Resolution: Fixed Status: Resolved (was: Patch Available) > FileSystemStorageStatistics#getLong(“readOps“) should return readOps + > largeReadOps > --- > > Key: HADOOP-13280 > URL: https://issues.apache.org/jira/browse/HADOOP-13280 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13280-branch-2.8.000.patch, > HADOOP-13280.000.patch, HADOOP-13280.001.patch > > > Currently {{FileSystemStorageStatistics}} instance simply returns data from > {{FileSystem$Statistics}}. As to {{readOps}}, the > {{FileSystem$Statistics#getReadOps()}} returns {{readOps + largeReadOps}}. We > should make the {{FileSystemStorageStatistics#getLong(“readOps“)}} return the > sum as well. > Moreover, there is no unit tests for {{FileSystemStorageStatistics}} and this > JIRA will also address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13288) Guard null stats key in FileSystemStorageStatistics
[ https://issues.apache.org/jira/browse/HADOOP-13288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340428#comment-15340428 ] Colin Patrick McCabe commented on HADOOP-13288: --- +1. Thanks, [~liuml07]. > Guard null stats key in FileSystemStorageStatistics > --- > > Key: HADOOP-13288 > URL: https://issues.apache.org/jira/browse/HADOOP-13288 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HADOOP-13288.000.patch, HADOOP-13288.001.patch > > > Currently in {{FileSystemStorageStatistics}} we simply returns data from > {{FileSystem#Statistics}}. However there is no null key check, which leads to > NPE problems to downstream applications. For example, we got a NPE when > passing a null key to {{FileSystemStorageStatistics#getLong()}}, exception > stack as following: > {quote} > NullPointerException > at > org.apache.hadoop.fs.FileSystemStorageStatistics.fetch(FileSystemStorageStatistics.java:80) > at > org.apache.hadoop.fs.FileSystemStorageStatistics.getLong(FileSystemStorageStatistics.java:108) > at > org.apache.tez.runtime.metrics.FileSystemStatisticsUpdater2.updateCounters(FileSystemStatisticsUpdater2.java:60) > at > org.apache.tez.runtime.metrics.TaskCounterUpdater.updateCounters(TaskCounterUpdater.java:118) > at > org.apache.tez.runtime.RuntimeTask.setFrameworkCounters(RuntimeTask.java:172) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:100) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > This jira is to add null stat key check to {{FileSystemStorageStatistics}}. > Thanks [~hitesh] for trying in Tez and reporting this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13280) FileSystemStorageStatistics#getLong(“readOps“) should return readOps + largeReadOps
[ https://issues.apache.org/jira/browse/HADOOP-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340017#comment-15340017 ] Colin Patrick McCabe commented on HADOOP-13280: --- java should be able to widen from int to long without a typecast. However, let's get this important fix in, and then worry about making it prettier. Thanks, [~liuml07]. +1. > FileSystemStorageStatistics#getLong(“readOps“) should return readOps + > largeReadOps > --- > > Key: HADOOP-13280 > URL: https://issues.apache.org/jira/browse/HADOOP-13280 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13280-branch-2.8.000.patch, > HADOOP-13280.000.patch, HADOOP-13280.001.patch > > > Currently {{FileSystemStorageStatistics}} instance simply returns data from > {{FileSystem$Statistics}}. As to {{readOps}}, the > {{FileSystem$Statistics#getReadOps()}} returns {{readOps + largeReadOps}}. We > should make the {{FileSystemStorageStatistics#getLong(“readOps“)}} return the > sum as well. > Moreover, there is no unit tests for {{FileSystemStorageStatistics}} and this > JIRA will also address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13280) FileSystemStorageStatistics#getLong(“readOps“) should return readOps + largeReadOps
[ https://issues.apache.org/jira/browse/HADOOP-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336647#comment-15336647 ] Colin Patrick McCabe commented on HADOOP-13280: --- Thanks, [~liuml07]. Does {{Long.valueOf(...)}} work? It would be nice to avoid the typecast if possible. > FileSystemStorageStatistics#getLong(“readOps“) should return readOps + > largeReadOps > --- > > Key: HADOOP-13280 > URL: https://issues.apache.org/jira/browse/HADOOP-13280 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13280-branch-2.8.000.patch, > HADOOP-13280.000.patch, HADOOP-13280.001.patch > > > Currently {{FileSystemStorageStatistics}} instance simply returns data from > {{FileSystem$Statistics}}. As to {{readOps}}, the > {{FileSystem$Statistics#getReadOps()}} returns {{readOps + largeReadOps}}. We > should make the {{FileSystemStorageStatistics#getLong(“readOps“)}} return the > sum as well. > Moreover, there is no unit tests for {{FileSystemStorageStatistics}} and this > JIRA will also address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-12975: -- Affects Version/s: (was: 2.9.0) 2.8.0 > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Fix For: 2.8.0 > > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch, HADOOP-12975v6.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-12975: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Fix For: 2.8.0 > > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch, HADOOP-12975v6.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335017#comment-15335017 ] Colin Patrick McCabe commented on HADOOP-12975: --- I was just adding jitter to the commit date. +1. Thanks, [~eclark]. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch, HADOOP-12975v6.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13284) FileSystemStorageStatistics must not attempt to read non-existent rack-aware read stats in branch-2.8
[ https://issues.apache.org/jira/browse/HADOOP-13284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13284: -- Resolution: Fixed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) > FileSystemStorageStatistics must not attempt to read non-existent rack-aware > read stats in branch-2.8 > - > > Key: HADOOP-13284 > URL: https://issues.apache.org/jira/browse/HADOOP-13284 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13284-branch-2.8.000.patch > > > As [HDFS-9579] was not committed to {{branch-2.8}}, > {{FileSystemStorageStatistics#KEYS}} should not include those rack aware read > stats brought by [HDFS-9579], including {{bytesReadLocalHost, > bytesReadDistanceOfOneOrTwo, bytesReadDistanceOfThreeOrFour, > bytesReadDistanceOfFiveOrLarger}}. Or else, the long iterator will throw NPE > when traversing. See detailed exception stack as following (it happens when > Tez uses the new FileSystemStorageStatistics). > {code} > 2016-06-15 15:56:59,242 [DEBUG] [TezChild] |impl.TezProcessorContextImpl|: > Cleared TezProcessorContextImpl related information > 2016-06-15 15:56:59,243 [WARN] [main] |task.TezTaskRunner2|: Exception from > RunnerCallable > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:74) > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:51) > at > org.apache.tez.runtime.metrics.FileSystemStatisticsUpdater2.updateCounters(FileSystemStatisticsUpdater2.java:51) > at > org.apache.tez.runtime.metrics.TaskCounterUpdater.updateCounters(TaskCounterUpdater.java:118) > at > org.apache.tez.runtime.RuntimeTask.setFrameworkCounters(RuntimeTask.java:172) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:100) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2016-06-15 15:56:59,245 [DEBUG] [main] |task.TaskReporter|: Sending heartbeat > to AM, request={ containerId=container_1466028486194_0005_01_02, > requestId=10, startIndex=0, preRoutedStartIndex=1, maxEventsToGet=500, > taskAttemptId=attempt_1466028486194_0005_1_00_00_0, eventCount=4 } > {code} > Thanks [~hitesh] for reporting this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13284) FileSystemStorageStatistics must not attempt to read non-existent rack-aware read stats in branch-2.8
[ https://issues.apache.org/jira/browse/HADOOP-13284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13284: -- Summary: FileSystemStorageStatistics must not attempt to read non-existent rack-aware read stats in branch-2.8 (was: Remove the rack-aware read stats in FileSystemStorageStatistics from branch-2.8) > FileSystemStorageStatistics must not attempt to read non-existent rack-aware > read stats in branch-2.8 > - > > Key: HADOOP-13284 > URL: https://issues.apache.org/jira/browse/HADOOP-13284 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13284-branch-2.8.000.patch > > > As [HDFS-9579] was not committed to {{branch-2.8}}, > {{FileSystemStorageStatistics#KEYS}} should not include those rack aware read > stats brought by [HDFS-9579], including {{bytesReadLocalHost, > bytesReadDistanceOfOneOrTwo, bytesReadDistanceOfThreeOrFour, > bytesReadDistanceOfFiveOrLarger}}. Or else, the long iterator will throw NPE > when traversing. See detailed exception stack as following (it happens when > Tez uses the new FileSystemStorageStatistics). > {code} > 2016-06-15 15:56:59,242 [DEBUG] [TezChild] |impl.TezProcessorContextImpl|: > Cleared TezProcessorContextImpl related information > 2016-06-15 15:56:59,243 [WARN] [main] |task.TezTaskRunner2|: Exception from > RunnerCallable > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:74) > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:51) > at > org.apache.tez.runtime.metrics.FileSystemStatisticsUpdater2.updateCounters(FileSystemStatisticsUpdater2.java:51) > at > org.apache.tez.runtime.metrics.TaskCounterUpdater.updateCounters(TaskCounterUpdater.java:118) > at > org.apache.tez.runtime.RuntimeTask.setFrameworkCounters(RuntimeTask.java:172) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:100) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2016-06-15 15:56:59,245 [DEBUG] [main] |task.TaskReporter|: Sending heartbeat > to AM, request={ containerId=container_1466028486194_0005_01_02, > requestId=10, startIndex=0, preRoutedStartIndex=1, maxEventsToGet=500, > taskAttemptId=attempt_1466028486194_0005_1_00_00_0, eventCount=4 } > {code} > Thanks [~hitesh] for reporting this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13284) Remove the rack-aware read stats in FileSystemStorageStatistics from branch-2.8
[ https://issues.apache.org/jira/browse/HADOOP-13284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334989#comment-15334989 ] Colin Patrick McCabe commented on HADOOP-13284: --- Thanks for spotting this, [~liuml07]. Good find. +1, will commit to 2.8 shortly. > Remove the rack-aware read stats in FileSystemStorageStatistics from > branch-2.8 > --- > > Key: HADOOP-13284 > URL: https://issues.apache.org/jira/browse/HADOOP-13284 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Attachments: HADOOP-13284-branch-2.8.000.patch > > > As [HDFS-9579] was not committed to {{branch-2.8}}, > {{FileSystemStorageStatistics#KEYS}} should not include those rack aware read > stats brought by [HDFS-9579], including {{bytesReadLocalHost, > bytesReadDistanceOfOneOrTwo, bytesReadDistanceOfThreeOrFour, > bytesReadDistanceOfFiveOrLarger}}. Or else, the long iterator will throw NPE > when traversing. See detailed exception stack as following (it happens when > Tez uses the new FileSystemStorageStatistics). > {code} > 2016-06-15 15:56:59,242 [DEBUG] [TezChild] |impl.TezProcessorContextImpl|: > Cleared TezProcessorContextImpl related information > 2016-06-15 15:56:59,243 [WARN] [main] |task.TezTaskRunner2|: Exception from > RunnerCallable > java.lang.NullPointerException > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:74) > at > org.apache.hadoop.fs.FileSystemStorageStatistics$LongStatisticIterator.next(FileSystemStorageStatistics.java:51) > at > org.apache.tez.runtime.metrics.FileSystemStatisticsUpdater2.updateCounters(FileSystemStatisticsUpdater2.java:51) > at > org.apache.tez.runtime.metrics.TaskCounterUpdater.updateCounters(TaskCounterUpdater.java:118) > at > org.apache.tez.runtime.RuntimeTask.setFrameworkCounters(RuntimeTask.java:172) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:100) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 2016-06-15 15:56:59,245 [DEBUG] [main] |task.TaskReporter|: Sending heartbeat > to AM, request={ containerId=container_1466028486194_0005_01_02, > requestId=10, startIndex=0, preRoutedStartIndex=1, maxEventsToGet=500, > taskAttemptId=attempt_1466028486194_0005_1_00_00_0, eventCount=4 } > {code} > Thanks [~hitesh] for reporting this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13280) FileSystemStorageStatistics#getLong(“readOps“) should return readOps + largeReadOps
[ https://issues.apache.org/jira/browse/HADOOP-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334982#comment-15334982 ] Colin Patrick McCabe commented on HADOOP-13280: --- Thanks for the patch, [~liuml07]. You are right that it should be readOps + largeReadOps. It's great to have a test as well. {code} return (long) (data.getReadOps() + data.getLargeReadOps()); {code} Do we need the typecast here? Seems like it shouldn't be required since the int should be promoted to a long automatically. +1 once that's addressed. > FileSystemStorageStatistics#getLong(“readOps“) should return readOps + > largeReadOps > --- > > Key: HADOOP-13280 > URL: https://issues.apache.org/jira/browse/HADOOP-13280 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs >Affects Versions: 2.8.0 >Reporter: Mingliang Liu >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13280.000.patch, HADOOP-13280.001.patch > > > Currently {{FileSystemStorageStatistics}} instance simply returns data from > {{FileSystem$Statistics}}. As to {{readOps}}, the > {{FileSystem$Statistics#getReadOps()}} returns {{readOps + largeReadOps}}. We > should make the {{FileSystemStorageStatistics#getLong(“readOps“)}} return the > sum as well. > Moreover, there is no unit tests for {{FileSystemStorageStatistics}} and this > JIRA will also address this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13223) winutils.exe is a bug nexus and should be killed with an axe.
[ https://issues.apache.org/jira/browse/HADOOP-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321328#comment-15321328 ] Colin Patrick McCabe commented on HADOOP-13223: --- Thanks for the explanation, [~cnauroth]. Migrating functionality to the DLL seems like a good idea long-term for a lot of reasons. > winutils.exe is a bug nexus and should be killed with an axe. > - > > Key: HADOOP-13223 > URL: https://issues.apache.org/jira/browse/HADOOP-13223 > Project: Hadoop Common > Issue Type: Improvement > Components: bin >Affects Versions: 2.6.0 > Environment: Microsoft Windows, all versions >Reporter: john lilley > > winutils.exe was apparently created as a stopgap measure to allow Hadoop to > "work" on Windows platforms, because the NativeIO libraries aren't > implemented there (edit: even NativeIO probably doesn't cover the operations > that winutils.exe is used for). Rather than building a DLL that makes native > OS calls, the creators of winutils.exe must have decided that it would be > more expedient to create an EXE to carry out file system operations in a > linux-like fashion. Unfortunately, like many stopgap measures in software, > this one has persisted well beyond its expected lifetime and usefulness. My > team creates software that runs on Windows and Linux, and winutils.exe is > probably responsible for 20% of all issues we encounter, both during > development and in the field. > Problem #1 with winutils.exe is that it is simply missing from many popular > distros and/or the client-side software installation for said distros, when > supplied, fails to install winutils.exe. Thus, as software developers, we > are forced to pick one version and distribute and install it with our > software. > Which leads to problem #2: winutils.exe are not always compatible. In > particular, MapR MUST have its winutils.exe in the system path, but doing so > breaks the Hadoop distro for every other Hadoop vendor. This makes creating > and maintaining test environments that work with all of the Hadoop distros we > want to test unnecessarily tedious and error-prone. > Problem #3 is that the mechanism by which you inform the Hadoop client > software where to find winutils.exe is poorly documented and fragile. First, > it can be in the PATH. If it is in the PATH, that is where it is found. > However, the documentation, such as it is, makes no mention of this, and > instead says that you should set the HADOOP_HOME environment variable, which > does NOT override the winutils.exe found in your system PATH. > Which leads to problem #4: There is no logging that says where winutils.exe > was actually found and loaded. Because of this, fixing problems of finding > the wrong winutils.exe are extremely difficult. > Problem #5 is that most of the time, such as when accessing straight up HDFS > and YARN, one does not *need* winutils.exe. But if it is missing, the log > messages complain about its absence. When we are trying to diagnose an > obscure issue in Hadoop (of which there are many), the presence of this red > herring leads to all sorts of time wasted until someone on the team points > out that winutils.exe is not the problem, at least not this time. > Problem #6 is that errors and stack traces from issues involving winutils.exe > are not helpful. The Java stack trace ends at the ProcessBuilder call. Only > through bitter experience is one able to connect the dots from > "ProcessBuilder is the last thing on the stack" to "something is wrong with > winutils.exe". > Note that none of these involve running Hadoop on Windows. They are only > encountered when using Hadoop client libraries to access a cluster from > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13223) winutils.exe is a bug nexus and should be killed with an axe.
[ https://issues.apache.org/jira/browse/HADOOP-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319318#comment-15319318 ] Colin Patrick McCabe commented on HADOOP-13223: --- Hmm. It's not clear to me why a DLL would be less prone to path problems than an EXE. It seems like we should just be putting a version number on the EXE, so that we avoid these conflicts. We have the same problem with libhadoop-- see HADOOP-11127. > winutils.exe is a bug nexus and should be killed with an axe. > - > > Key: HADOOP-13223 > URL: https://issues.apache.org/jira/browse/HADOOP-13223 > Project: Hadoop Common > Issue Type: Improvement > Components: bin >Affects Versions: 2.6.0 > Environment: Microsoft Windows, all versions >Reporter: john lilley > > winutils.exe was apparently created as a stopgap measure to allow Hadoop to > "work" on Windows platforms, because the NativeIO libraries aren't > implemented there (edit: even NativeIO probably doesn't cover the operations > that winutils.exe is used for). Rather than building a DLL that makes native > OS calls, the creators of winutils.exe must have decided that it would be > more expedient to create an EXE to carry out file system operations in a > linux-like fashion. Unfortunately, like many stopgap measures in software, > this one has persisted well beyond its expected lifetime and usefulness. My > team creates software that runs on Windows and Linux, and winutils.exe is > probably responsible for 20% of all issues we encounter, both during > development and in the field. > Problem #1 with winutils.exe is that it is simply missing from many popular > distros and/or the client-side software installation for said distros, when > supplied, fails to install winutils.exe. Thus, as software developers, we > are forced to pick one version and distribute and install it with our > software. > Which leads to problem #2: winutils.exe are not always compatible. In > particular, MapR MUST have its winutils.exe in the system path, but doing so > breaks the Hadoop distro for every other Hadoop vendor. This makes creating > and maintaining test environments that work with all of the Hadoop distros we > want to test unnecessarily tedious and error-prone. > Problem #3 is that the mechanism by which you inform the Hadoop client > software where to find winutils.exe is poorly documented and fragile. First, > it can be in the PATH. If it is in the PATH, that is where it is found. > However, the documentation, such as it is, makes no mention of this, and > instead says that you should set the HADOOP_HOME environment variable, which > does NOT override the winutils.exe found in your system PATH. > Which leads to problem #4: There is no logging that says where winutils.exe > was actually found and loaded. Because of this, fixing problems of finding > the wrong winutils.exe are extremely difficult. > Problem #5 is that most of the time, such as when accessing straight up HDFS > and YARN, one does not *need* winutils.exe. But if it is missing, the log > messages complain about its absence. When we are trying to diagnose an > obscure issue in Hadoop (of which there are many), the presence of this red > herring leads to all sorts of time wasted until someone on the team points > out that winutils.exe is not the problem, at least not this time. > Problem #6 is that errors and stack traces from issues involving winutils.exe > are not helpful. The Java stack trace ends at the ProcessBuilder call. Only > through bitter experience is one able to connect the dots from > "ProcessBuilder is the last thing on the stack" to "something is wrong with > winutils.exe". > Note that none of these involve running Hadoop on Windows. They are only > encountered when using Hadoop client libraries to access a cluster from > Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13137) TraceAdmin should support Kerberized cluster
[ https://issues.apache.org/jira/browse/HADOOP-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13137: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) > TraceAdmin should support Kerberized cluster > > > Key: HADOOP-13137 > URL: https://issues.apache.org/jira/browse/HADOOP-13137 > Project: Hadoop Common > Issue Type: Bug > Components: tracing >Affects Versions: 2.6.0, 3.0.0-alpha1 > Environment: CDH5.5.1 cluster with Kerberos >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang > Labels: Kerberos > Fix For: 2.8.0 > > Attachments: HADOOP-13137.001.patch, HADOOP-13137.002.patch, > HADOOP-13137.003.patch, HADOOP-13137.004.patch, HADOOP-13137.005.patch > > > When I run {{hadoop trace}} command for a Kerberized NameNode, it failed with > the following error: > [hdfs@weichiu-encryption-1 root]$ hadoop trace -list -host > weichiu-encryption-1.vpc.cloudera.com:802216/05/12 00:02:13 WARN ipc.Client: > Exception encountered while connecting to the server : > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name > 16/05/12 00:02:13 WARN security.UserGroupInformation: > PriviledgedActionException as:h...@vpc.cloudera.com (auth:KERBEROS) > cause:java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > Exception in thread "main" java.io.IOException: Failed on local exception: > java.io.IOException: java.lang.IllegalArgumentException: Failed to specify > server's Kerberos principal name; Host Details : local host is: > "weichiu-encryption-1.vpc.cloudera.com/172.26.8.185"; destination host is: > "weichiu-encryption-1.vpc.cloudera.com":8022; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1470) > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy11.listSpanReceivers(Unknown Source) > at > org.apache.hadoop.tracing.TraceAdminProtocolTranslatorPB.listSpanReceivers(TraceAdminProtocolTranslatorPB.java:58) > at > org.apache.hadoop.tracing.TraceAdmin.listSpanReceivers(TraceAdmin.java:68) > at org.apache.hadoop.tracing.TraceAdmin.run(TraceAdmin.java:177) > at org.apache.hadoop.tracing.TraceAdmin.main(TraceAdmin.java:195) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > ... 7 more > Caused by: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name > at > org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:322) > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:231) > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555) > at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720) > ... 10 more > It is failing because {{TraceAdmin}} does not set up the property > {{CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY}} > Fixing it may require some restructuring, as the NameNode principal > {{dfs.namenode.kerberos.principal
[jira] [Commented] (HADOOP-13137) TraceAdmin should support Kerberized cluster
[ https://issues.apache.org/jira/browse/HADOOP-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308999#comment-15308999 ] Colin Patrick McCabe commented on HADOOP-13137: --- bq. The test failures look unrelated. I agree-- I ran them locally, and they passed. Thanks, [~jojochuang] and [~steve_l]. +1. > TraceAdmin should support Kerberized cluster > > > Key: HADOOP-13137 > URL: https://issues.apache.org/jira/browse/HADOOP-13137 > Project: Hadoop Common > Issue Type: Bug > Components: tracing >Affects Versions: 2.6.0, 3.0.0-alpha1 > Environment: CDH5.5.1 cluster with Kerberos >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang > Labels: Kerberos > Attachments: HADOOP-13137.001.patch, HADOOP-13137.002.patch, > HADOOP-13137.003.patch, HADOOP-13137.004.patch, HADOOP-13137.005.patch > > > When I run {{hadoop trace}} command for a Kerberized NameNode, it failed with > the following error: > [hdfs@weichiu-encryption-1 root]$ hadoop trace -list -host > weichiu-encryption-1.vpc.cloudera.com:802216/05/12 00:02:13 WARN ipc.Client: > Exception encountered while connecting to the server : > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name > 16/05/12 00:02:13 WARN security.UserGroupInformation: > PriviledgedActionException as:h...@vpc.cloudera.com (auth:KERBEROS) > cause:java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > Exception in thread "main" java.io.IOException: Failed on local exception: > java.io.IOException: java.lang.IllegalArgumentException: Failed to specify > server's Kerberos principal name; Host Details : local host is: > "weichiu-encryption-1.vpc.cloudera.com/172.26.8.185"; destination host is: > "weichiu-encryption-1.vpc.cloudera.com":8022; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1470) > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy11.listSpanReceivers(Unknown Source) > at > org.apache.hadoop.tracing.TraceAdminProtocolTranslatorPB.listSpanReceivers(TraceAdminProtocolTranslatorPB.java:58) > at > org.apache.hadoop.tracing.TraceAdmin.listSpanReceivers(TraceAdmin.java:68) > at org.apache.hadoop.tracing.TraceAdmin.run(TraceAdmin.java:177) > at org.apache.hadoop.tracing.TraceAdmin.main(TraceAdmin.java:195) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > ... 7 more > Caused by: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name > at > org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:322) > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:231) > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555) > at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720) > ... 10 more > It is failing because {{TraceAdmin}} does not set up the property > {{CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY}} > Fixing it may require some restructuring, as the NameNode principal > {{dfs.namenode.
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15301319#comment-15301319 ] Colin Patrick McCabe commented on HADOOP-13010: --- Thanks for your work on this, [~drankye]. +1. Let's continue the discussion on the follow-on JIRAs. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 3.0.0-alpha1 >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch, HADOOP-13010-v4.patch, HADOOP-13010-v5.patch, > HADOOP-13010-v6.patch, HADOOP-13010-v7.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298600#comment-15298600 ] Colin Patrick McCabe commented on HADOOP-13010: --- {{TestCodecRawCoderMapping}} fails for me: {code} testRSDefaultRawCoder(org.apache.hadoop.io.erasurecode.TestCodecRawCoderMapping) Time elapsed: 0.015 sec <<< FAILURE! java.lang.AssertionError: null at org.junit.Assert.fail(Assert.java:86) at org.junit.Assert.assertTrue(Assert.java:41) at org.junit.Assert.assertTrue(Assert.java:52) at org.apache.hadoop.io.erasurecode.TestCodecRawCoderMapping.testRSDefaultRawCoder(TestCodecRawCoderMapping.java:54) {code} > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch, HADOOP-13010-v4.patch, HADOOP-13010-v5.patch, > HADOOP-13010-v6.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298587#comment-15298587 ] Colin Patrick McCabe commented on HADOOP-13010: --- It was nice talking to you, [~drankye]. It's too bad that we didn't have more time (it was a busy week because I was going out of town). bq. As I explained as above, \[the configuration-based\] approach might not work in all cases, because: there are more than one codecs to be configured and for each of these codecs there may be more than one coder implementation to be configured, and it's not easy to flatten the two layers into one dimension (here you used algorithm). I think these are really configuration questions, not questions about how the code should be structured. What does the user actually need to configure? If the user just configures a coder implementation, does that fully determine the codec which is being used? If so, we should have only one configuration knob-- coder. If a coder could be used for multiple codecs, then we need to have at least two knobs that the user can configure-- one for codec, and another for coder. Once we know what the configuration knobs are, we probably only need one or two functions to create the objects we need based on a {{Configuration}} object, not a whole mess of factory objects. Anyway, let's talk about refactoring codec configuration and factories in a follow-on JIRA. I think we've made a lot of good progress here and it will helpful to get this patch committed. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch, HADOOP-13010-v4.patch, HADOOP-13010-v5.patch, > HADOOP-13010-v6.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-7352) Contracts of LocalFileSystem and DistributedFileSystem should require FileSystem::listStatus throw IOException not return null upon access error
[ https://issues.apache.org/jira/browse/HADOOP-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296947#comment-15296947 ] Colin Patrick McCabe commented on HADOOP-7352: -- This should be easier with the new jdk7 changes. We now have access to directory listing APIs like DirectoryStream that throw IOEs on problems instead of returning null. > Contracts of LocalFileSystem and DistributedFileSystem should require > FileSystem::listStatus throw IOException not return null upon access error > > > Key: HADOOP-7352 > URL: https://issues.apache.org/jira/browse/HADOOP-7352 > Project: Hadoop Common > Issue Type: Improvement > Components: fs, fs/s3 >Reporter: Matt Foley >Assignee: Matt Foley > > In HADOOP-6201 and HDFS-538 it was agreed that FileSystem::listStatus should > throw FileNotFoundException instead of returning null, when the target > directory did not exist. > However, in LocalFileSystem implementation today, FileSystem::listStatus > still may return null, when the target directory exists but does not grant > read permission. This causes NPE in many callers, for all the reasons cited > in HADOOP-6201 and HDFS-538. See HADOOP-7327 and its linked issues for > examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13137) TraceAdmin should support Kerberized cluster
[ https://issues.apache.org/jira/browse/HADOOP-13137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293773#comment-15293773 ] Colin Patrick McCabe commented on HADOOP-13137: --- The patch looks good to me. I think there are a few other commands that might need to get an argument like this, if it's necessary when communicating directly with a kerberized Hadoop server. I do wonder why we need a new file, TestKerberizedTraceAdmin.java, when it could have been a test in TestTraceAdmin.java, but I don't feel that strongly about it. Thanks, [~jojochuang] and [~steve_l]. > TraceAdmin should support Kerberized cluster > > > Key: HADOOP-13137 > URL: https://issues.apache.org/jira/browse/HADOOP-13137 > Project: Hadoop Common > Issue Type: Bug > Components: tracing >Affects Versions: 2.6.0, 3.0.0-alpha1 > Environment: CDH5.5.1 cluster with Kerberos >Reporter: Wei-Chiu Chuang >Assignee: Wei-Chiu Chuang > Labels: Kerberos > Attachments: HADOOP-13137.001.patch, HADOOP-13137.002.patch > > > When I run {{hadoop trace}} command for a Kerberized NameNode, it failed with > the following error: > [hdfs@weichiu-encryption-1 root]$ hadoop trace -list -host > weichiu-encryption-1.vpc.cloudera.com:802216/05/12 00:02:13 WARN ipc.Client: > Exception encountered while connecting to the server : > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name > 16/05/12 00:02:13 WARN security.UserGroupInformation: > PriviledgedActionException as:h...@vpc.cloudera.com (auth:KERBEROS) > cause:java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > Exception in thread "main" java.io.IOException: Failed on local exception: > java.io.IOException: java.lang.IllegalArgumentException: Failed to specify > server's Kerberos principal name; Host Details : local host is: > "weichiu-encryption-1.vpc.cloudera.com/172.26.8.185"; destination host is: > "weichiu-encryption-1.vpc.cloudera.com":8022; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1470) > at org.apache.hadoop.ipc.Client.call(Client.java:1403) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy11.listSpanReceivers(Unknown Source) > at > org.apache.hadoop.tracing.TraceAdminProtocolTranslatorPB.listSpanReceivers(TraceAdminProtocolTranslatorPB.java:58) > at > org.apache.hadoop.tracing.TraceAdmin.listSpanReceivers(TraceAdmin.java:68) > at org.apache.hadoop.tracing.TraceAdmin.run(TraceAdmin.java:177) > at org.apache.hadoop.tracing.TraceAdmin.main(TraceAdmin.java:195) > Caused by: java.io.IOException: java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1519) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > ... 7 more > Caused by: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name > at > org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:322) > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:231) > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555) > at org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720) > ... 10 more > It is failing because {{Tra
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289361#comment-15289361 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. Not sure if it's good to have something like isXOR or isRS, because we'll have more coder algorithms other than the current both. That's a fair point. It seems unlikely that we need an isXOR or isRS method. bq. OK, we can have \[createRawEncoder\]. Maybe it wouldn't be bad to have two shortcut methods additionally, createRSRawEncoder and createXORRawEncoder, because the both are the primitive, essential and most used ones in implementing advanced coders and HDFS side. I want the both to be outstanding and easily used. It seems better just to have one function, {{createRawEncoder}}, than to have lots of functions for every type of encoder. bq. I discussed about this with Uma Maheswara Rao G quite some ago when introducing these factories. There isn't a clear way to compose or reduce the full class name of a raw coder because it should be plugin-able and configurable. In current approach, for each codec, there could be some coder implementations, and for each, the corresponding coder factory can be configured. We discussed this offline and I think the conclusion is that we don't need the factories for anything. We can just have a configuration key like {{erasure.coder.algorithm}} and then code like this: {code} RawErasureEncoder createRawEncoder(Configuration conf) { String classPrefix = conf.get("erasure.coder.algorithm", DEFAULT_ERASURE_CODER_ALGORITHM); String name = classPrefix + "Encoder"; Constructor ctor = classLoader.loadClass(name).getConstructor(Configuration.class); return ctor.newInstance(conf); } RawErasureDecoder createRawDecoder(Configuration conf) { String classPrefix = conf.get("erasure.coder.algorithm", DEFAULT_ERASURE_CODER_ALGORITHM); String name = classPrefix + "Decoder"; Constructor ctor = classLoader.loadClass(name).getConstructor(Configuration.class); return ctor.newInstance(conf); } {code} bq. It seems this can simplify the related functions, but am not sure it would make the codes more readable. The mentioned variables are very specific to encode/decode related calls using on-heap bytebuffer or byte array buffers. Maybe DecodingState could be kept simple not putting too many intermediate variables because the codes using of them are not suitable to be moved to the class. Reducing the number of function parameters from 8 or 9 to 1 or 2 seems like it would make the code much more readable. I don't understand what the rationale is for keeping these parameters out of DecodingState. Perhaps we could discuss this in a follow-on JIRA, though. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch, HADOOP-13010-v4.patch, HADOOP-13010-v5.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13140) FileSystem#initialize must not attempt to create StorageStatistics objects with null or empty schemes
[ https://issues.apache.org/jira/browse/HADOOP-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13140: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) > FileSystem#initialize must not attempt to create StorageStatistics objects > with null or empty schemes > - > > Key: HADOOP-13140 > URL: https://issues.apache.org/jira/browse/HADOOP-13140 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.8.0 >Reporter: Brahma Reddy Battula >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13140.000.patch, HADOOP-13140.001.patch, > HADOOP-13140.002.patch > > > {{org.apache.hadoop.fs.GlobalStorageStatistics#put}} is not checking the null > scheme, and the internal map will complain NPE. This was reported by a flaky > test {{TestFileSystemApplicationHistoryStore}}. Thanks [~brahmareddy] for > reporting. > To address this, > # Fix the test by providing a valid URI, e.g. {{file:///}} > # Guard the null scheme in {{GlobalStorageStatistics#put}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13140) FileSystem#initialize must not attempt to create StorageStatistics objects with null or empty schemes
[ https://issues.apache.org/jira/browse/HADOOP-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13140: -- Summary: FileSystem#initialize must not attempt to create StorageStatistics objects with null or empty schemes (was: GlobalStorageStatistics should check null FileSystem scheme to avoid NPE) > FileSystem#initialize must not attempt to create StorageStatistics objects > with null or empty schemes > - > > Key: HADOOP-13140 > URL: https://issues.apache.org/jira/browse/HADOOP-13140 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.8.0 >Reporter: Brahma Reddy Battula >Assignee: Mingliang Liu > Attachments: HADOOP-13140.000.patch, HADOOP-13140.001.patch, > HADOOP-13140.002.patch > > > {{org.apache.hadoop.fs.GlobalStorageStatistics#put}} is not checking the null > scheme, and the internal map will complain NPE. This was reported by a flaky > test {{TestFileSystemApplicationHistoryStore}}. Thanks [~brahmareddy] for > reporting. > To address this, > # Fix the test by providing a valid URI, e.g. {{file:///}} > # Guard the null scheme in {{GlobalStorageStatistics#put}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13140) GlobalStorageStatistics should check null FileSystem scheme to avoid NPE
[ https://issues.apache.org/jira/browse/HADOOP-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289221#comment-15289221 ] Colin Patrick McCabe commented on HADOOP-13140: --- +1. Thanks, [~liuml07]. > GlobalStorageStatistics should check null FileSystem scheme to avoid NPE > > > Key: HADOOP-13140 > URL: https://issues.apache.org/jira/browse/HADOOP-13140 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.8.0 >Reporter: Brahma Reddy Battula >Assignee: Mingliang Liu > Attachments: HADOOP-13140.000.patch, HADOOP-13140.001.patch, > HADOOP-13140.002.patch > > > {{org.apache.hadoop.fs.GlobalStorageStatistics#put}} is not checking the null > scheme, and the internal map will complain NPE. This was reported by a flaky > test {{TestFileSystemApplicationHistoryStore}}. Thanks [~brahmareddy] for > reporting. > To address this, > # Fix the test by providing a valid URI, e.g. {{file:///}} > # Guard the null scheme in {{GlobalStorageStatistics#put}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13140) GlobalStorageStatistics should check null FileSystem scheme to avoid NPE
[ https://issues.apache.org/jira/browse/HADOOP-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287297#comment-15287297 ] Colin Patrick McCabe commented on HADOOP-13140: --- {code} /** Called after a new FileSystem instance is constructed. * @param name a uri whose authority section names the host, port, etc. * for this FileSystem * @param conf the configuration */ public void initialize(URI name, Configuration conf) throws IOException { statistics = getStatistics(name.getScheme(), getClass()); resolveSymlinks = conf.getBoolean( CommonConfigurationKeys.FS_CLIENT_RESOLVE_REMOTE_SYMLINKS_KEY, CommonConfigurationKeys.FS_CLIENT_RESOLVE_REMOTE_SYMLINKS_DEFAULT); } {code} If {{name#getScheme()}} is empty or null here, we can use {{FileSystem#getDefaultUri#getScheme}} to pass a non-null scheme. That should cover almost all the cases where a null scheme would be passed. If the user intentionally passes a null or empty scheme directly to {{FileSystem#getStatistics}}, we should throw an exception. > GlobalStorageStatistics should check null FileSystem scheme to avoid NPE > > > Key: HADOOP-13140 > URL: https://issues.apache.org/jira/browse/HADOOP-13140 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.8.0 >Reporter: Brahma Reddy Battula >Assignee: Mingliang Liu > Attachments: HADOOP-13140.000.patch > > > {{org.apache.hadoop.fs.GlobalStorageStatistics#put}} is not checking the null > scheme, and the internal map will complain NPE. This was reported by a flaky > test {{TestFileSystemApplicationHistoryStore}}. Thanks [~brahmareddy] for > reporting. > To address this, > # Fix the test by providing a valid URI, e.g. {{file:///}} > # Guard the null scheme in {{GlobalStorageStatistics#put}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285257#comment-15285257 ] Colin Patrick McCabe commented on HADOOP-13065: --- bq. I filed HADOOP-13140 to track the effort. Thanks, [~liuml07]. bq. [~mingma] wrote: BTW, is the network distance metrics something general for all file system or it is more specific to HDFS? For example, local file system doesn't need that. If it is more HDFS specific, wonder if we should move it to HDFS specific metrics. I agree that it's more conceptually consistent to put the distance-related metrics in HDFS-specific code. However, we would have to develop an optimized thread-local mechanism to do this, to avoid causing a performance regression in HDFS stream performance. Perhaps it would be better to simply move this to HDFS's existing per-stream ReadStatistics for now. > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HADOOP-13065.011.patch, > HADOOP-13065.012.patch, HADOOP-13065.013.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13140) GlobalStorageStatistics should check null FileSystem scheme to avoid NPE
[ https://issues.apache.org/jira/browse/HADOOP-13140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285252#comment-15285252 ] Colin Patrick McCabe commented on HADOOP-13140: --- Thanks for looking at this. I don't think it makes sense to use a null, empty string, or otherwise invalid string to identify a statistics object. It has no meaning to the user. I think if the user passes a URL with a null schema, we should call {{FileSystem#getDefaultUri}} and use the default schema, similar to how {{FileSystem#get}} functions. > GlobalStorageStatistics should check null FileSystem scheme to avoid NPE > > > Key: HADOOP-13140 > URL: https://issues.apache.org/jira/browse/HADOOP-13140 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.8.0 >Reporter: Brahma Reddy Battula >Assignee: Mingliang Liu > Attachments: HADOOP-13140.000.patch > > > {{org.apache.hadoop.fs.GlobalStorageStatistics#put}} is not checking the null > scheme, and the internal map will complain NPE. This was reported by a flaky > test {{TestFileSystemApplicationHistoryStore}}. Thanks [~brahmareddy] for > reporting. > To address this, > # Fix the test by providing a valid URI, e.g. {{file:///}} > # Guard the null scheme in {{GlobalStorageStatistics#put}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11505) Various native parts use bswap incorrectly and unportably
[ https://issues.apache.org/jira/browse/HADOOP-11505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285233#comment-15285233 ] Colin Patrick McCabe commented on HADOOP-11505: --- I think adding a PPC nightly build would be a step in the right direction. Of course, people interested in making Hadoop work well on PPC would still have to fix occasional breakages and performance regression. Apache is a do-ocracy so if people want to put in the work to do it, it will get done. > Various native parts use bswap incorrectly and unportably > - > > Key: HADOOP-11505 > URL: https://issues.apache.org/jira/browse/HADOOP-11505 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.0-alpha1 >Reporter: Colin Patrick McCabe >Assignee: Alan Burlison >Priority: Critical > Attachments: HADOOP-11505.001.patch, HADOOP-11505.003.patch, > HADOOP-11505.004.patch, HADOOP-11505.005.patch, HADOOP-11505.006.patch, > HADOOP-11505.007.patch, HADOOP-11505.008.patch > > > hadoop-mapreduce-client-nativetask fails to use x86 optimizations in some > cases. Also, on some alternate, non-x86, non-ARM architectures the generated > code is incorrect. Thanks to Steve Loughran and Edward Nevill for finding > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-10942) Globbing optimizations and regression fix
[ https://issues.apache.org/jira/browse/HADOOP-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283108#comment-15283108 ] Colin Patrick McCabe edited comment on HADOOP-10942 at 5/13/16 8:31 PM: The regression referred to here was fixed in HADOOP-10957. The optimizations are already implemented (we don't perform an RPC on each path component, only when we need to do so to implement a wildcard.) was (Author: cmccabe): The regression fix referred to here was fixed in HADOOP-10957. The optimizations are already implemented (we don't perform an RPC on each path component, only when we need to do so to implement a wildcard.) > Globbing optimizations and regression fix > - > > Key: HADOOP-10942 > URL: https://issues.apache.org/jira/browse/HADOOP-10942 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.1.0-beta, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Labels: BB2015-05-TBR > Attachments: HADOOP-10942.patch > > > When globbing was commonized to support both filesystem and filecontext, it > regressed a fix that prevents an intermediate glob that matches a file from > throwing a confusing permissions exception. The hdfs traverse check requires > the exec bit which a file does not have. > Additional optimizations to reduce rpcs actually increases them if > directories contain 1 item. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-10942) Globbing optimizations and regression fix
[ https://issues.apache.org/jira/browse/HADOOP-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-10942: -- Resolution: Fixed Status: Resolved (was: Patch Available) The regression fix referred to here was fixed in HADOOP-10957. The optimizations are already implemented (we don't perform an RPC on each path component, only when we need to do so to implement a wildcard.) > Globbing optimizations and regression fix > - > > Key: HADOOP-10942 > URL: https://issues.apache.org/jira/browse/HADOOP-10942 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 2.1.0-beta, 3.0.0-alpha1 >Reporter: Daryn Sharp >Assignee: Daryn Sharp >Priority: Critical > Labels: BB2015-05-TBR > Attachments: HADOOP-10942.patch > > > When globbing was commonized to support both filesystem and filecontext, it > regressed a fix that prevents an intermediate glob that matches a file from > throwing a confusing permissions exception. The hdfs traverse check requires > the exec bit which a file does not have. > Additional optimizations to reduce rpcs actually increases them if > directories contain 1 item. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13142) Change project version from 3.0.0 to 3.0.0-alpha1
[ https://issues.apache.org/jira/browse/HADOOP-13142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282323#comment-15282323 ] Colin Patrick McCabe commented on HADOOP-13142: --- +1. Thanks, [~andrew.wang]. > Change project version from 3.0.0 to 3.0.0-alpha1 > - > > Key: HADOOP-13142 > URL: https://issues.apache.org/jira/browse/HADOOP-13142 > Project: Hadoop Common > Issue Type: Improvement >Affects Versions: 3.0.0-alpha1 >Reporter: Andrew Wang >Assignee: Andrew Wang >Priority: Blocker > Attachments: hadoop-13142.001.patch > > > We want to rename 3.0.0 to 3.0.0-alpha1 for the first alpha release. However, > the version number is also encoded outside of the pom.xml's, so we need to > update these too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281731#comment-15281731 ] Colin Patrick McCabe commented on HADOOP-12975: --- Thanks, [~eclark]. +1 pending jenkins > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch, HADOOP-12975v6.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281704#comment-15281704 ] Colin Patrick McCabe commented on HADOOP-13028: --- bq. This is something to bring up on the dev list, as it is something we essentially missed. Colin Patrick McCabe: would you care for the honour? Sure. I started a thread on common-dev. bq. Steve has \[added the stability comment\] in patch v011. Great. Here is my +1 as well. Thanks again, guys. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-012.patch, HADOOP-13028-013.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > HADOOP-13028-branch-2-012.patch, HADOOP-13028-branch-2-013.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281102#comment-15281102 ] Colin Patrick McCabe commented on HADOOP-13028: --- That's a good point, [~cnauroth]. I guess as long as people don't start treating this output as a stable API, it's reasonable to have debugging information there. Can we add a comment to toString stating that this output is not stable API and should not be parsed? +1 once that is done. Thanks for working on this, [~steve_l]... it's going to be very helpful for running Hadoop on s3. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11505) Various native parts use bswap incorrectly and unportably
[ https://issues.apache.org/jira/browse/HADOOP-11505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281046#comment-15281046 ] Colin Patrick McCabe commented on HADOOP-11505: --- I think it would be great to see build slaves with alternate architectures. Maybe a good place to start is by emailing the hadoop development list and talking to the infrastructure team. > Various native parts use bswap incorrectly and unportably > - > > Key: HADOOP-11505 > URL: https://issues.apache.org/jira/browse/HADOOP-11505 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Alan Burlison >Priority: Critical > Fix For: 3.0.0 > > Attachments: HADOOP-11505.001.patch, HADOOP-11505.003.patch, > HADOOP-11505.004.patch, HADOOP-11505.005.patch, HADOOP-11505.006.patch, > HADOOP-11505.007.patch, HADOOP-11505.008.patch > > > hadoop-mapreduce-client-nativetask fails to use x86 optimizations in some > cases. Also, on some alternate, non-x86, non-ARM architectures the generated > code is incorrect. Thanks to Steve Loughran and Edward Nevill for finding > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13065: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) committed to 2.8. Thanks, [~liuml07]. > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Fix For: 2.8.0 > > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HADOOP-13065.011.patch, > HADOOP-13065.012.patch, HADOOP-13065.013.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Resolved] (HADOOP-12749) Create a threadpoolexecutor that overrides afterExecute to log uncaught exceptions/errors
[ https://issues.apache.org/jira/browse/HADOOP-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe resolved HADOOP-12749. --- Resolution: Fixed Fix Version/s: (was: 2.9.0) 2.8.0 Target Version/s: 2.8.0 Backported to 2.8 > Create a threadpoolexecutor that overrides afterExecute to log uncaught > exceptions/errors > - > > Key: HADOOP-12749 > URL: https://issues.apache.org/jira/browse/HADOOP-12749 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Fix For: 2.8.0 > > Attachments: HADOOP-12749.001.patch, HADOOP-12749.002.patch, > HADOOP-12749.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Reopened] (HADOOP-12749) Create a threadpoolexecutor that overrides afterExecute to log uncaught exceptions/errors
[ https://issues.apache.org/jira/browse/HADOOP-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe reopened HADOOP-12749: --- > Create a threadpoolexecutor that overrides afterExecute to log uncaught > exceptions/errors > - > > Key: HADOOP-12749 > URL: https://issues.apache.org/jira/browse/HADOOP-12749 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Sidharta Seethana >Assignee: Sidharta Seethana > Fix For: 2.9.0 > > Attachments: HADOOP-12749.001.patch, HADOOP-12749.002.patch, > HADOOP-12749.003.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280782#comment-15280782 ] Colin Patrick McCabe edited comment on HADOOP-13028 at 5/11/16 8:43 PM: In the past I've written code for Spark that used reflection to make use of APIs that may or may not be present in Hadoop. HBase often does this as well, so that it can use multiple versions of Hadoop. It seems like this wouldn't be a lot of code. Is that feasible in this case? I just find the argument that we should overload an existing unrelated API to output statistics very off-putting. I guess you could argue that the statistics is part of the stream state, and toString is intended to reflect stream state. But it will result in very long output from toString which probably isn't what most existing callers want. And it's not consistent with the way any other hadoop streams work, including other s3 ones like s3n. [~andrew.wang], [~cnauroth], [~liuml07], what do you think about this? Is it acceptable to overload {{toString}} in this way, to output statistics? The argument seems to be that this easier than using reflection to get the actual stream statistics object. was (Author: cmccabe): In the past I've written code for Spark that used reflection to make use of APIs that may or may not be present in Hadoop. HBase often does this as well, so that it can use multiple versions of Hadoop. It seems like this wouldn't be a lot of code. Is that feasible in this case? I just find the argument that we should overload an existing unrelated API to output statistics very off-putting. It's like saying we should override hashCode to output the number of times the user called {{seek()}} on the stream. I guess you could argue that the statistics is part of the stream state, and toString is intended to reflect stream state. But it will result in very long output from toString which probably isn't what most existing callers want. And it's not consistent with the way any other hadoop streams work, including other s3 ones like s3n. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280790#comment-15280790 ] Colin Patrick McCabe commented on HADOOP-13065: --- +1 for version 13. Thanks, [~liuml07]. > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HADOOP-13065.011.patch, > HADOOP-13065.012.patch, HADOOP-13065.013.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280782#comment-15280782 ] Colin Patrick McCabe edited comment on HADOOP-13028 at 5/11/16 8:39 PM: In the past I've written code for Spark that used reflection to make use of APIs that may or may not be present in Hadoop. HBase often does this as well, so that it can use multiple versions of Hadoop. It seems like this wouldn't be a lot of code. Is that feasible in this case? I just find the argument that we should overload an existing unrelated API to output statistics very off-putting. It's like saying we should override hashCode to output the number of times the user called {{seek()}} on the stream. I guess you could argue that the statistics is part of the stream state, and toString is intended to reflect stream state. But it will result in very long output from toString which probably isn't what most existing callers want. And it's not consistent with the way any other hadoop streams work, including other s3 ones like s3n. was (Author: cmccabe): In the past I've written code for Spark that used reflection to make use of APIs that may or may not be present in Hadoop. HBase often does this as well, so that it can use multiple versions of Hadoop. It seems like this wouldn't be a lot of code. Is that feasible in this case? I just find the argument that we should overload an existing unrelated API to output statistics very off-putting. It's like saying we should override hashCode to output the number of times the user called {{seek()}} on the stream. I also find it concerning that this would be something unique to s3a and not present in the toString methods of any other filesystem (including the other s3 ones). It feels like a gross hack. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280782#comment-15280782 ] Colin Patrick McCabe commented on HADOOP-13028: --- In the past I've written code for Spark that used reflection to make use of APIs that may or may not be present in Hadoop. HBase often does this as well, so that it can use multiple versions of Hadoop. It seems like this wouldn't be a lot of code. Is that feasible in this case? I just find the argument that we should overload an existing unrelated API to output statistics very off-putting. It's like saying we should override hashCode to output the number of times the user called {{seek()}} on the stream. I also find it concerning that this would be something unique to s3a and not present in the toString methods of any other filesystem (including the other s3 ones). It feels like a gross hack. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-11505) Various native parts use bswap incorrectly and unportably
[ https://issues.apache.org/jira/browse/HADOOP-11505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280588#comment-15280588 ] Colin Patrick McCabe commented on HADOOP-11505: --- The problematic part of this change was making all the subprojects depend on hadoop-common. It seems like you could avoid doing that by putting all the le32to_h, etc. definitions in a standalone header file and having the other projects include that file. > Various native parts use bswap incorrectly and unportably > - > > Key: HADOOP-11505 > URL: https://issues.apache.org/jira/browse/HADOOP-11505 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Alan Burlison >Priority: Critical > Fix For: 3.0.0 > > Attachments: HADOOP-11505.001.patch, HADOOP-11505.003.patch, > HADOOP-11505.004.patch, HADOOP-11505.005.patch, HADOOP-11505.006.patch, > HADOOP-11505.007.patch, HADOOP-11505.008.patch > > > hadoop-mapreduce-client-nativetask fails to use x86 optimizations in some > cases. Also, on some alternate, non-x86, non-ARM architectures the generated > code is incorrect. Thanks to Steve Loughran and Edward Nevill for finding > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-11505) Various native parts use bswap incorrectly and unportably
[ https://issues.apache.org/jira/browse/HADOOP-11505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-11505: -- Priority: Major (was: Blocker) This isn't a blocker, because the affected architectures can fall back on the non-native code for accomplishing the same things. > Various native parts use bswap incorrectly and unportably > - > > Key: HADOOP-11505 > URL: https://issues.apache.org/jira/browse/HADOOP-11505 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Colin Patrick McCabe >Assignee: Alan Burlison > Fix For: 3.0.0 > > Attachments: HADOOP-11505.001.patch, HADOOP-11505.003.patch, > HADOOP-11505.004.patch, HADOOP-11505.005.patch, HADOOP-11505.006.patch, > HADOOP-11505.007.patch, HADOOP-11505.008.patch > > > hadoop-mapreduce-client-nativetask fails to use x86 optimizations in some > cases. Also, on some alternate, non-x86, non-ARM architectures the generated > code is incorrect. Thanks to Steve Loughran and Edward Nevill for finding > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278854#comment-15278854 ] Colin Patrick McCabe commented on HADOOP-12975: --- Thanks, [~eclark]. {code} 169 // add/subtract the jitter. 170 refreshInterval += 171 ThreadLocalRandom.current() 172 .nextLong(jitter, jitter); {code} Hmm, is this a typo? It seems like this is always going to return exactly 'jitter' since the 'least' and the 'bound' arguments are the same? That seems to defeat the point of randomization. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadLocalRandom.html#nextLong(long,%20long) {code} 126 if (configuration == null) { 127 return DEFAULT_JITTER; 128 } {code} Can we throw an exception in {{GetSpaceUsed#build}} if {{conf == null}}? It's a weird special case to have no {{Configuration}} object, and I'm not sure why we'd ever want to do that. Then this function could just be {{return this.conf.getLong(JITTER_KEY, DEFAULT_JITTER);}}. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch, HADOOP-12975v4.patch, > HADOOP-12975v5.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278833#comment-15278833 ] Colin Patrick McCabe edited comment on HADOOP-13065 at 5/10/16 8:34 PM: Thanks, [~liuml07]. {{DFSOpsCountStatistics}} is a nice implementation. It's also nice to have this for webhdfs as well. {code} 156 @Override 157 public Long getLong(String key) { 158 final OpType type = OpType.fromSymbol(key); 159 return type == null ? 0L : opsCount.get(type).get(); 160 } {code} I think this should return null in the case where type == null, right? Indicating that there is no such statistic. {code} 159 storageStatistics = (DFSOpsCountStatistics) GlobalStorageStatistics.INSTANCE 160 .put(DFSOpsCountStatistics.NAME, 161 new StorageStatisticsProvider() { 162 @Override 163 public StorageStatistics provide() { 164 return new DFSOpsCountStatistics(); 165 } 166 }); {code} Hmm, I wonder if these StorageStatistics objects should be per-FS-instance rather than per-class? I guess let's do that in a follow-on, though, after this gets committed. +1 for HADOOP-13065.012.patch once the null thing is fixed was (Author: cmccabe): Thanks, [~liuml07]. {{DFSOpsCountStatistics}} is a nice implementation. It's also nice to have this for webhdfs as well. {code} 156 @Override 157 public Long getLong(String key) { 158 final OpType type = OpType.fromSymbol(key); 159 return type == null ? 0L : opsCount.get(type).get(); 160 } {code} I think this should return null in the case where type == null, right? Indicating that there is no such statistic. {code} 159 storageStatistics = (DFSOpsCountStatistics) GlobalStorageStatistics.INSTANCE 160 .put(DFSOpsCountStatistics.NAME, 161 new StorageStatisticsProvider() { 162 @Override 163 public StorageStatistics provide() { 164 return new DFSOpsCountStatistics(); 165 } 166 }); {code} Hmm, I wonder if these StorageStatistics objects should be per-FS-instance rather than per-class? I guess let's do that in a follow-on, though, after this gets committed. +1 once the null thing is fixed > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HADOOP-13065.011.patch, > HADOOP-13065.012.patch, HDFS-10175.000.patch, HDFS-10175.001.patch, > HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, > HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278833#comment-15278833 ] Colin Patrick McCabe commented on HADOOP-13065: --- Thanks, [~liuml07]. {{DFSOpsCountStatistics}} is a nice implementation. It's also nice to have this for webhdfs as well. {code} 156 @Override 157 public Long getLong(String key) { 158 final OpType type = OpType.fromSymbol(key); 159 return type == null ? 0L : opsCount.get(type).get(); 160 } {code} I think this should return null in the case where type == null, right? Indicating that there is no such statistic. {code} 159 storageStatistics = (DFSOpsCountStatistics) GlobalStorageStatistics.INSTANCE 160 .put(DFSOpsCountStatistics.NAME, 161 new StorageStatisticsProvider() { 162 @Override 163 public StorageStatistics provide() { 164 return new DFSOpsCountStatistics(); 165 } 166 }); {code} Hmm, I wonder if these StorageStatistics objects should be per-FS-instance rather than per-class? I guess let's do that in a follow-on, though, after this gets committed. +1 once the null thing is fixed > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HADOOP-13065.011.patch, > HADOOP-13065.012.patch, HDFS-10175.000.patch, HDFS-10175.001.patch, > HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, > HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278800#comment-15278800 ] Colin Patrick McCabe commented on HADOOP-13028: --- bq. Patrick: regarding fs.s3a.readahead.range versus calling it fs.s3a.readahead.default, I think "default" could be a bit confusing too. How about I make it clear that the if setReadahead() is set, then it supercedes any previous value? Sure. bq. I absolutely need that printing in there, otherwise the value of this patch is significantly reduced. If you want me to add a line like "WARNING: UNSTABLE" or something to that string value, I'm happy to do so. Or the output is published in a way that is deliberately hard to parse by machine but which we humans can read. But without that information, we can't so easily tell which Perhaps I'm missing something, but why not just do this in {{S3AInstrumentation#InputStreamStatistics#toString}}? I don't see why this is "absolutely needed" in {{S3AInputStream#toString}}. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, HADOOP-13028-branch-2-011.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276778#comment-15276778 ] Colin Patrick McCabe commented on HADOOP-13028: --- bq. I think this is OK. The whole close method is synchronized, so we won't have two threads concurrently doing the actual close. Almost all other accesses of closed are within synchronized methods too. It's marked volatile to help with one unsynchronized access from readFully, calling into checkNotClosed. That's only a read, not an update, so volatile is sufficient. Thanks for the explanation. I missed the interaction between {{synchronized}} and the assignment. Suggest adding a comment to the assignment in {{close()}} explaining why this is atomic, or simply using AtomicBoolean to future-proof this against later code changes. bq. I'd like to keep \[the toString changes\]. It's very convenient for logging. TestS3AInputStreamPerformance uses it for both logging output and detailed assertion messages. It's poor practice to rely on a Java object's toString output as a stable, parseable format. This is something that I'd like to see clarified in our compatibility documentation. The problem is, this is not consistent with how {{toString}} operates in other FS streams. We also don't have anything in our compatibility documentation stating that the output of {{toString}} is not a stable, parseable format. We've had many, many JIRAs to "make toString act like some previous behavior" for various Hadoop classes. I think we need to accept that currently the stream's {{toString}} method is viewed as a public, stable API whether we like it or not. How about just adding this information to the {{toString}} method of the stream statistics object? That makes more sense anyway. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274641#comment-15274641 ] Colin Patrick McCabe commented on HADOOP-13028: --- {code} 926 927 fs.s3a.readahead.range 928 65536 929 Bytes to read ahead during a seek() before closing and 930 re-opening the S3 HTTP connection. 931 {code} Hmm, should this be {{fs.s3a.readahead.default}}? It seems like this is the default if the user doesn't call {{FSDataInputStream#setReadahead}}, {{S3AInputStream#closed}}: it seems like this should be an {{AtomicBoolean}}. Otherwise two threads could both enter this code block, right? {code} 362 if (!closed) { 363 closed = true; 364 super.close(); 365 closeStream("close() operation", this.contentLength); 366 streamStatistics.close(); 367 } {code} {code} public S3AInstrumentation.InputStreamStatistics getStreamStatistics() { {code} Maybe should be called {{getS3StreamStatistics}}, reflecting the fact that this API is s3-specific? Is it really necessary to put statistics information into the {{toString}} methods of the streams? It seems like this could lead to compatibility woes, and we have the API described above to provide this information anyway. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > HADOOP-13028-007.patch, HADOOP-13028-008.patch, HADOOP-13028-009.patch, > HADOOP-13028-branch-2-008.patch, HADOOP-13028-branch-2-009.patch, > HADOOP-13028-branch-2-010.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274625#comment-15274625 ] Colin Patrick McCabe commented on HADOOP-12975: --- Does this need a rebase? > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch, HADOOP-12975v3.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15274617#comment-15274617 ] Colin Patrick McCabe commented on HADOOP-13065: --- bq. One quick question is that, some of the storage statistics classes (e.g. GlobalStorageStatistics are annotated as Stable, do we have to be a bit more conservative by making them Unstable before ultimately removing the Statistics? Good question. I think that what would happen is that the old API would become deprecated in branch-2, and removed in branch-3. There isn't any need to change the annotation since we don't plan to modify the interface, just remove it. bq. As follow-on work, 1. We can move the rack-awareness read bytes to a separate storage statistics as it's only used by HDFS, and 2. We can remove Statistics API, but keep the thread local implementation in FileSystemStorageStatistics class. That makes sense. One thing that we've talked about doing in the past is moving these statistics to a separate java file, so that they could be used in both FileContext and FileSystem. Maybe we could call them something like ThreadLocalFsStatistics or something? > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HADOOP-13065.010.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273605#comment-15273605 ] Colin Patrick McCabe commented on HADOOP-13065: --- Thanks for the reviews. bq. in FileSystem.getStatistics(), For performance, you could try using ConcurrentMap for the map, and only if it is not present create the objects and call putIfAbsent() (or a synchronized block create and update the maps (with a second lookup there to eliminate the small race condition). This will eliminate the sync point on a simple lookup when the entry exists. Hmm. I don't think that we really need to optimize this function. When using the new API, the only time this function gets called is when a new FileSystem object is created, which should be very rare. bq. For testing a may to reset/remove an entry could be handy. We do have some tests that zero out the existing statistics objects. I'm not sure if removing the entry really gets us more coverage than we have now, since we know that it was created by this code path (therefore the code path was tested). bq. That's said, we can firstly deprecate the FileSystem#getStatistics()? Agree. > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HADOOP-13065.008.patch, > HADOOP-13065.009.patch, HDFS-10175.000.patch, HDFS-10175.001.patch, > HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, > HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272936#comment-15272936 ] Colin Patrick McCabe commented on HADOOP-13010: --- Thanks, guys. Sorry for the delay in reviewing. We've been busy. {{CodecUtil.java}}: there are a LOT of functions here for creating {{RawErasureEncoder}} objects. We've got: {code} createRSRawEncoder(Configuration conf, int numDataUnits, int numParityUnits, String codec) createRSRawEncoder(Configuration conf, int numDataUnit, int numParityUnit) createRSRawEncoder(Configuration conf, String codec, ErasureCoderOptions coderOptions) createRSRawEncoder(Configuration conf, ErasureCoderOptions coderOptions) createXORRawEncoder(Configuration conf, ErasureCoderOptions coderOptions) createXORRawEncoder(Configuration conf, int numDataUnits, int numParityUnits) createRawEncoder(Configuration conf, String rawCoderFactoryKey, ErasureCoderOptions coderOptions) {code} Plus a similar number of functions for creating decoders. Why do we have to have so many functions? Surely the codec, numParityUnits, numDataUnits, whether it is XOR or not, etc. etc. should just be included in ErasureCoderOptions. Then we could just have one function: {code} createRawEncoder(Configuration conf, ErasureCoderOptions coderOptions) {code} On a related note, why does each particular type of encoder need its own factory? It seems like we just need a static function for each encoder type that takes a Configuration and ErasureCoderOptions, and we're good to go. We can locate these static functions via reflection. {code} protected void doDecode(DecodingState decodingState, byte[][] inputs, int[] inputOffsets, int[] erasedIndexes, byte[][] outputs, int[] outputOffsets) { {code} Can we just include the inputs, inputOffsets, erasedIndexes, outputs, outputOffsets in {{DecodingState}}? > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch, HADOOP-13010-v4.patch, HADOOP-13010-v5.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) dfs -ls -q prints non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272802#comment-15272802 ] Colin Patrick McCabe commented on HADOOP-13079: --- bq. It's not a security bug for the reasons you think it's a security bug. After all, wc, find, du, ... tons of other UNIX commands will happily print out terminal escape sequences with no option to turn them off. It is, however, problematic for traditional ftpd implementations since it's a great way to inject buffer overflows and then get root on a remote server. This behavior is exploitable. That makes it a security bug, even if lots of traditional UNIX commands have it. Just because a behavior is traditional doesn't mean it's right. There was a time when UNIX programs used {{gets()}} everywhere. When the world became a less trusting place, they had to be fixed not to do that. We should understand the motivations behind historical decisions before blindly copying them. bq. ... and my answer is the same as it was almost a decade ago, in some HDFS JIRA somewhere, where a related topic came up before: HDFS would be better served by having a limit on what consists of a legal file and directory name. With an unlimited namespace, it's impossible to test against and impossible to protect every scenario in which oddball characters show up. What's legal in one locale may not be legal in another. That's a very good suggestion. I think we should tackle that for Hadoop 3. bq. Also, are you prepared to file a CVE for every single time Hadoop prints out a directory or file name to the screen? There are probably hundreds if not thousands of places, obvious ones like 'fs -count' and less obvious ones like 'yarn logs'. This is a 'tilting at windmills' problem. It is MUCH better to have ls blow up than be taken by surprise by something else later on. The problem is, {{ls}} isn't necessarily going to "blow up," just display something odd, or even cause your xterm to run arbitrary code by abusing escape sequences. > dfs -ls -q prints non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: John Zhuge >Assignee: John Zhuge > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) dfs -ls -q prints non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271848#comment-15271848 ] Colin Patrick McCabe commented on HADOOP-13079: --- bq. a) It's not standardized behavior amongst all of the platforms that Apache Hadoop runs Linux, OpenBSD, FreeBSD, and OS X pick the behavior of hiding control characters in {{ls}} by default. That may not be "all the platforms that Apache Hadoop runs on," but it's certainly the vast majority of real-world deployments. The remaining important platform, Windows, doesn't deal with terminals and control characters in quite the same way, so is probably not vulnerable in any case. In any case, the fact that the behavior isn't standardized is not a valid argument either way. Clearly Hadoop needs to pick one behavior or the other. Lack of standardization doesn't dictate that we have to pick one behavior or the other. And certainly it doesn't dictate that we should pick an unpopular and surprising behavior that almost nobody has experience with. bq. b) It's not expected behavior relative to the rest of Apache Hadoop The fact that one component has a security bug doesn't dictate that the other components also need to have the same security bug. This is like arguing that we can't fix a buffer overflow in one component because then it wouldn't match all the other buffer-overflowable components. bq. c) It's not feasible to actually make it expected behavior compared to the rest of Apache Hadoop given the proliferation of places where raw file and directory names are printed to the console The only places we've discussed here are ls and fsck. Perhaps there are more, but it hardly seems infeasible to change them based on what we've talked about so far. Perhaps log files are also an issue, but only for people who tail the log file of the server. And to reiterate, a security flaw in X doesn't mean we should reproduce the same security flaw in Y. At the end of the day, this is a security vulnerability and it needs to be fixed. I asked you before: "Should the filename be able use control characters to hijack the admin's GNU screen session and execute arbitrary code? I would say no, what do you say?" I would repeat the same question again. I understand that you have a personal preference for running without {{\-q}}. However, it is not constructive to -1 a patch fixing a security vulnerability without suggesting an alternate way of fixing that vulnerability. If this stays unfixed, it will probably get a CVE number. > dfs -ls -q prints non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: John Zhuge >Assignee: John Zhuge > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) Add -q to fs -ls to print non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269527#comment-15269527 ] Colin Patrick McCabe commented on HADOOP-13079: --- bq. Yup, I can't think of why -q should be the default either... but more importantly, neither could POSIX to the point that it demanded the standard have -q be the default. Please do not misquote what I said. I was arguing that echoing control characters to the terminal should not be the default behavior. You are arguing the opposite. bq. ... until such a point that they print the filename to the screen to show what files are being processed. At which point this change has accomplished absolutely nothing. Changing ls is security theater. There are a lot of scripts that interact with HDFS via FsShell. These scripts will never "print the filename to the screen" or if they do, it will be a filename that they got from {{ls}} itself which does not contain control characters. I could come up with examples of how this is helpful all day if needed. Here's another one: Some sysadmin logs in and does an {{hadoop fs -ls}} of a directory created by {{\$BADGUY}}. Should the filename be able use control characters to hijack the admin's GNU screen session and execute arbitrary code? I would say no, what do you say? bq. Are we going to change cat too? Most system administrators will not {{cat}} a file without checking what type it is. It is well-known that catting an unknown file could mess up the terminal. On the other hand, most system administrators do not think that running {{ls}} on a directory could be a security risk. Linux and other well known operating systems also do not protect users from this, so there are no pre-existing expectations of protection. bq. Then stop bringing up (traditional) UNIX if you feel it isn't relevant and especially when you've used the term incorrectly. There are a huge number of sysadmins who grew up with the GNU tools, which do have the behavior we're describing here. It's a powerful argument for implementing that behavior. When you add the fact that it fixes security vulnerabilities, it's an extremely compelling argument. I think it's clear that this change does have a big positive effect in many scenarios, does fix real-world security flaws, and does accord with the expectations of most system administrators. That's three powerful reasons to do it. I can find no valid counter-argument for any of these reasons anywhere in these comments. > Add -q to fs -ls to print non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Reporter: John Zhuge >Assignee: John Zhuge > Labels: supportability > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) Add -q to fs -ls to print non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269375#comment-15269375 ] Colin Patrick McCabe commented on HADOOP-13079: --- OK, so Linux is technically a UNIX-like system rather than a licensee of the UNIX trademark. I don't feel that this is relevant to the discussion here. I feel like you are just being pedantic. Linux's behavior is still the one that most people compare our behavior to, whether we like it or not. And Linux's behavior is to hide control characters by default in ls. More importantly, Linux's behavior makes more sense than the other behavior you are suggesting. Dumping control characters out on an interactive terminal is a security vulnerability as well as a giant annoyance. I can't think of a single reason why we would want this to be the default. > Add -q to fs -ls to print non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Reporter: John Zhuge >Assignee: John Zhuge > Labels: supportability > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) Add -q to fs -ls to print non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268981#comment-15268981 ] Colin Patrick McCabe commented on HADOOP-13079: --- Thank you for the background information. I wasn't aware that the default of suppressing non-printing characters was "optional" according to POSIX. I think the important thing is that we've established that: * Suppressing non-printing characters by default fixes several serious security vulnerabilties, including some that have CVEs, * This suppression behavior is explicitly allowed by POSIX, * The most popular UNIX system on Earth, Linux, implements this behavior, so nobody will be surprised by it. bq. Essentially interactive sessions with stdin redirected \[falsely show up as non-interactive from Java\] I guess my concern about adding a JNI dependency here is that it will make things too nondeterministic. I've seen too many clusters where JNI was improperly configured. > Add -q to fs -ls to print non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Reporter: John Zhuge >Assignee: John Zhuge > Labels: supportability > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268961#comment-15268961 ] Colin Patrick McCabe commented on HADOOP-13065: --- Interesting post... I wasn't aware that AtomicLong etc. had performance issues. However, I don't think we need an API for updating metrics. We only need an API for _reading_ metrics. The current read API in this patch supports reading primitive longs, which should work well with {{AtomicLongFieldUpdater}}, or whatever else we want to use. > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13079) Add -q to fs -ls to print non-printable characters
[ https://issues.apache.org/jira/browse/HADOOP-13079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266989#comment-15266989 ] Colin Patrick McCabe commented on HADOOP-13079: --- bq. No way should -q be the default under any circumstances. That is extremely surprising behavior that will definitely break stuff. It's not surprising, because it matches the traditional UNIX / Linux behavior. In Linux, {{/bin/ls}} will not print control characters by default. you must pass the {{--show-control-characters}} option in order to see them. From the man page: {code} --show-control-chars show non graphic characters as-is (default unless program is 'ls' and output is a terminal) {code} {{ls}} blasting raw control characters into an interactive terminal is a very bad idea. It leads to some very serious security vulnerabilities because commonly used software like {{xterm}}, {{GNU screen}}, {{tmux}} and so forth interpret control characters. Using control characters, you can convince these pieces of software to execute arbitrary code. See http://marc.info/?l=bugtraq&m=104612710031920&q=p3 and https://www.proteansec.com/linux/blast-past-executing-code-terminal-emulators-via-escape-sequences/ There are even CVEs for some of these issues. We should make the default opt-in for printing control characters in our next compatibility-breaking release (Hadoop 3.x). bq. In C, isatty(STDOUT_FILENO) is used to find out whether the output is a terminal. Since Java doesn't have isatty, I will use JNI to call C isatty() because the closest test System.console() == null does not work in some cases. It would really be nice if we could determine this without using JNI, because it's often not available. Under what conditions does the {{System.console() == null}} check not work? The only case I was able to find in a quick Google search was inside an eclipse console. That seems like a case where the security issues would not be a concern, because it's a debugging environment. Are there other cases where the non-JNI check would fail? > Add -q to fs -ls to print non-printable characters > -- > > Key: HADOOP-13079 > URL: https://issues.apache.org/jira/browse/HADOOP-13079 > Project: Hadoop Common > Issue Type: Bug >Reporter: John Zhuge >Assignee: John Zhuge > Labels: supportability > > Add option {{-q}} to "hdfs dfs -ls" to print non-printable characters as "?". > Non-printable characters are defined by > [isprint(3)|http://linux.die.net/man/3/isprint] according to the current > locale. > Default to {{-q}} behavior on terminal; otherwise, print raw characters. See > the difference in these 2 command lines: > * {{hadoop fs -ls /dir}} > * {{hadoop fs -ls /dir | od -c}} > In C, {{isatty(STDOUT_FILENO)}} is used to find out whether the output is a > terminal. Since Java doesn't have {{isatty}}, I will use JNI to call C > {{isatty()}} because the closest test {{System.console() == null}} does not > work in some cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Updated] (HADOOP-13072) WindowsGetSpaceUsed constructor should be public
[ https://issues.apache.org/jira/browse/HADOOP-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13072: -- Resolution: Fixed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) > WindowsGetSpaceUsed constructor should be public > > > Key: HADOOP-13072 > URL: https://issues.apache.org/jira/browse/HADOOP-13072 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Labels: windows > Fix For: 2.8.0 > > Attachments: HADOOP-13072-01.patch, HADOOP-13072-02.patch > > > WindowsGetSpaceUsed constructor should be made public. > Otherwise building using builder will not work. > {noformat}2016-04-29 12:49:37,455 [Thread-108] WARN fs.GetSpaceUsed$Builder > (GetSpaceUsed.java:build(127)) - Doesn't look like the class class > org.apache.hadoop.fs.WindowsGetSpaceUsed have the needed constructor > java.lang.NoSuchMethodException: > org.apache.hadoop.fs.WindowsGetSpaceUsed.(org.apache.hadoop.fs.GetSpaceUsed$Builder) > at java.lang.Class.getConstructor0(Unknown Source) > at java.lang.Class.getConstructor(Unknown Source) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:118) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.(BlockPoolSlice.java:165) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:915) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:907) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:413) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13072) WindowsGetSpaceUsed constructor should be public
[ https://issues.apache.org/jira/browse/HADOOP-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266874#comment-15266874 ] Colin Patrick McCabe commented on HADOOP-13072: --- +1. Thanks, [~vinayrpet]. > WindowsGetSpaceUsed constructor should be public > > > Key: HADOOP-13072 > URL: https://issues.apache.org/jira/browse/HADOOP-13072 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Labels: windows > Attachments: HADOOP-13072-01.patch, HADOOP-13072-02.patch > > > WindowsGetSpaceUsed constructor should be made public. > Otherwise building using builder will not work. > {noformat}2016-04-29 12:49:37,455 [Thread-108] WARN fs.GetSpaceUsed$Builder > (GetSpaceUsed.java:build(127)) - Doesn't look like the class class > org.apache.hadoop.fs.WindowsGetSpaceUsed have the needed constructor > java.lang.NoSuchMethodException: > org.apache.hadoop.fs.WindowsGetSpaceUsed.(org.apache.hadoop.fs.GetSpaceUsed$Builder) > at java.lang.Class.getConstructor0(Unknown Source) > at java.lang.Class.getConstructor(Unknown Source) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:118) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.(BlockPoolSlice.java:165) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:915) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:907) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:413) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13072) WindowsGetSpaceUsed constructor should be public
[ https://issues.apache.org/jira/browse/HADOOP-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264909#comment-15264909 ] Colin Patrick McCabe commented on HADOOP-13072: --- Thanks, [~vinayrpet] and [~steve_l]. +1 once the line is trimmed to 80 characters and jenkins has run. > WindowsGetSpaceUsed constructor should be public > > > Key: HADOOP-13072 > URL: https://issues.apache.org/jira/browse/HADOOP-13072 > Project: Hadoop Common > Issue Type: Bug >Affects Versions: 2.8.0 >Reporter: Vinayakumar B >Assignee: Vinayakumar B > Labels: windows > Attachments: HADOOP-13072-01.patch > > > WindowsGetSpaceUsed constructor should be made public. > Otherwise building using builder will not work. > {noformat}2016-04-29 12:49:37,455 [Thread-108] WARN fs.GetSpaceUsed$Builder > (GetSpaceUsed.java:build(127)) - Doesn't look like the class class > org.apache.hadoop.fs.WindowsGetSpaceUsed have the needed constructor > java.lang.NoSuchMethodException: > org.apache.hadoop.fs.WindowsGetSpaceUsed.(org.apache.hadoop.fs.GetSpaceUsed$Builder) > at java.lang.Class.getConstructor0(Unknown Source) > at java.lang.Class.getConstructor(Unknown Source) > at > org.apache.hadoop.fs.GetSpaceUsed$Builder.build(GetSpaceUsed.java:118) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.(BlockPoolSlice.java:165) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:915) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:907) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:413) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261354#comment-15261354 ] Colin Patrick McCabe commented on HADOOP-13028: --- Thanks, [~steve_l]. I withdraw my -1, provided we don't add any new public APIs in this patch. I'm out tomorrow and Friday but hopefully I'll have a chance to review it next week (if someone doesn't review it first). > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, HADOOP-13028-006.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13065: -- Attachment: HADOOP-13065-007.patch > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HADOOP-13065-007.patch, HDFS-10175.000.patch, > HDFS-10175.001.patch, HDFS-10175.002.patch, HDFS-10175.003.patch, > HDFS-10175.004.patch, HDFS-10175.005.patch, HDFS-10175.006.patch, > TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-13065) Add a new interface for retrieving FS and FC Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-13065: -- Summary: Add a new interface for retrieving FS and FC Statistics (was: add per-operation stats to FileSystem.Statistics) > Add a new interface for retrieving FS and FC Statistics > --- > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, > HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, > HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13065) add per-operation stats to FileSystem.Statistics
[ https://issues.apache.org/jira/browse/HADOOP-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261118#comment-15261118 ] Colin Patrick McCabe commented on HADOOP-13065: --- Thanks, [~liuml07]. Based on the discussion today, it sounds like we would like to have both global statistics per FS class, and per-instance statistics for an individual FS or FC instance. The rationale for this is that in some cases we might want to differentiate between, say, the stats when talking to one s3 bucket, and another s3 bucket. Or another example is the stats talking to one HDFS FS versus another HDFS FS (if we are using federation, or just multiple HDFS instances). We talked a bit about metrics2, but there were several things that made it not a good fit for this statistics interface. One issue is that metrics2 assumes that statistics are permanent once created. Effectively, it keeps them around until the JVM terminates. metrics2 also tends to use a fair amount of memory and require a fair amount of boilerplate code compared to other solutions. Finally, because it is global, it can't do per-instance stats very effectively. It would be nice for the new statistics interface to provide the same stats which are currently provided by FileSystem#Statistics. This would allow us to deprecate and eventually remove FileSystem#Statistics as a public interface (although we might keep the implementation). This could be done only in a new release of Hadoop, of course. We also talked about the benefits of providing an iterator over all statistics rather than a map of all statistics. Relatedly, we talked about the desire to have a new interface that was abstract enough to accommodate new, more efficient implementations in the future. For now, the new interface will deal with per-FS stats, but not per-stream ones. We should revisit per-stream statistics later. > add per-operation stats to FileSystem.Statistics > > > Key: HADOOP-13065 > URL: https://issues.apache.org/jira/browse/HADOOP-13065 > Project: Hadoop Common > Issue Type: Improvement > Components: fs >Reporter: Ram Venkatesh >Assignee: Mingliang Liu > Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, > HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, > HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java > > > Currently FileSystem.Statistics exposes the following statistics: > BytesRead > BytesWritten > ReadOps > LargeReadOps > WriteOps > These are in-turn exposed as job counters by MapReduce and other frameworks. > There is logic within DfsClient to map operations to these counters that can > be confusing, for instance, mkdirs counts as a writeOp. > Proposed enhancement: > Add a statistic for each DfsClient operation including create, append, > createSymlink, delete, exists, mkdirs, rename and expose them as new > properties on the Statistics object. The operation-specific counters can be > used for analyzing the load imposed by a particular job on HDFS. > For example, we can use them to identify jobs that end up creating a large > number of files. > Once this information is available in the Statistics object, the app > frameworks like MapReduce can expose them as additional counters to be > aggregated and recorded as part of job summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13028) add low level counter metrics for S3A; use in read performance tests
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258713#comment-15258713 ] Colin Patrick McCabe commented on HADOOP-13028: --- It looks really good, [~steve_l]. Just to avoid misunderstandings, I'll drop a -1 here until we finish discussing what the interface should be... I look forward to giving this a review as soon as we figure that out. > add low level counter metrics for S3A; use in read performance tests > > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: HADOOP-13028-001.patch, HADOOP-13028-002.patch, > HADOOP-13028-004.patch, HADOOP-13028-005.patch, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt, > org.apache.hadoop.fs.s3a.scale.TestS3AInputStreamPerformance-output.txt > > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12949) Add metrics and HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257083#comment-15257083 ] Colin Patrick McCabe commented on HADOOP-12949: --- Since HADOOP-13028 is focusing on metrics for s3a, let's focus this JIRA on just HTrace integration. It's a good idea to read up on HDFS-10175 as well, since we've been discussing what interface(s) we'd like the FS metrics to have in the future there. > Add metrics and HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HADOOP-12949) Add HTrace to the s3a connector
[ https://issues.apache.org/jira/browse/HADOOP-12949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-12949: -- Summary: Add HTrace to the s3a connector (was: Add metrics and HTrace to the s3a connector) > Add HTrace to the s3a connector > --- > > Key: HADOOP-12949 > URL: https://issues.apache.org/jira/browse/HADOOP-12949 > Project: Hadoop Common > Issue Type: Improvement > Components: fs/s3 >Reporter: Madhawa Gunasekara >Assignee: Madhawa Gunasekara > > Hi All, > s3, GCS, WASB, and other cloud blob stores are becoming increasingly > important in Hadoop. But we don't have distributed tracing for these yet. It > would be interesting to add distributed tracing here. It would enable > collecting really interesting data like probability distributions of PUT and > GET requests to s3 and their impact on MR jobs, etc. > I would like to implement this feature, Please shed some light on this > Thanks, > Madhawa -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13028) add counter and timer metrics for S3A HTTP & low-level operations
[ https://issues.apache.org/jira/browse/HADOOP-13028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254358#comment-15254358 ] Colin Patrick McCabe commented on HADOOP-13028: --- Hi [~steve_l], This is a really interesting idea. I think this ties in with some of the discussions we've been having on HDFS-10175 with adding a way to fetch arbitrary statistics from FileSystem (and FileContext) instances. Basically, HDFS-10175 provides a way for MR to enumerate all the statistics and their values. It also provides interfaces for finding just one statistic, of course. This would also enable the use of those statistics in unit tests, since the stats could be per-FS rather than global per type. > add counter and timer metrics for S3A HTTP & low-level operations > - > > Key: HADOOP-13028 > URL: https://issues.apache.org/jira/browse/HADOOP-13028 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3, metrics >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > > against S3 (and other object stores), opening connections can be expensive, > closing connections may be expensive (a sign of a regression). > S3A FS and individual input streams should have counters of the # of > open/close/failure+reconnect operations, timers of how long things take. This > can be used downstream to measure efficiency of the code (how often > connections are being made), connection reliability, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250781#comment-15250781 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. Thanks for the discussions Colin / Kai. The "dummy" coders are for tests only. Either name sounds fine to me. No-op is a more accurate technical description, while dummy better states the purpose (and therefore prevent users from actually using it). Maybe we can leave that one open and move this refactor forward. Yeah, I don't have a strong opinion on "Dummy" versus "NoOp." Either name could work. It also seems reasonable to let users configure this to diagnose issues in the field. So it makes sense to keep it in src/ rather than test/. bq. The suggested way and sample codes look great. It consolidates configurations and coder options together and has an advantage that the coder options will also be configurable. I will use it. Great! Looking forward to the next revision. Thanks again, [~drankye] and [~zhz]. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246922#comment-15246922 ] Colin Patrick McCabe commented on HADOOP-13010: --- {{ErasureCoderConf#setCoderOption}} / {{ErasureCoderConf#getCoderOption}}: I don't see why we need to have these. If these options are generic to all erasure encoders, then they can just go as "regular java fields" like {{ErasureCoderConf#numDataUnits}}, etc. On the other hand, if these options only apply to one type of Coder, then they should be stored in the particular type of coder they apply to. The usual way to do this is to have your Encoder / Decoder class take a Configuration object as an argument, and pull out whatever values it needs. For example, you might have code like this: {code} FoobarEncoder(Configuration conf) { this.coderConf = new ErasureCoderConf(conf); this.foobarity = conf.getLong("foobarity", 123); } {code} The idea is that things that are specific to a class go in that class, rather than trying to handle it with casts to and from Object. Also, mutable configuration is unpleasant (what happens if you call {{ErasureCoderConf#setCoderOption}} when the Encoder / Decoder has already been created? It seems like what we actually want to do in this case is not modify the configuration, but build a new Encoder / Decoder with a new configuration. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246913#comment-15246913 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. Yeah, it's good to document this stateless property in JavaDoc. Note by the way it doesn't mean these encoder/decoder are to support concurrency though it's possible. I would leave this for future consideration. Sure. In that case, we should document that these objects are not guaranteed to be thread-safe, so that there is no confusion. bq. Ah yes the names (EncoderState/DecoderState) are bad, actually I meant them to be EncodingState/DecodingState. OK. bq. Ok, I'm probably convinced by you. Thanks for the lots of insights. I got rid of the base class anyway, and introduced ErasureCoderConf for the variables and methods in it. As you might check the updated patch, there are some duplicate of small shortcuts between the encoder base class and decoder base class as they now lack a common base. I suppose it's acceptable. Great. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch, > HADOOP-13010-v3.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243883#comment-15243883 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. Yes you're right I meant some data to be shared between multiple concurrent encode or decode operations. The data only makes sense for a coder instance (binds a schema) so it's not suitable to be static; on the other hand it's also decode call specific so it's also not suitable to reside in the coder instance. Thanks for the explanation. It sounds like {{Encoder}} and {{Decoder}} will be stateless once they're created. Basically, they just reflect an algorithm and some data required to implement that algorithm. That is reasonable. We should document this in the JavaDoc for the classes. I also agree that we should add a new class to represent the state which the Decoder / Encoder is manipulating. The problem with calling this class {{DecoderState}} is that this name suggests that it is the state of the coder, rather than state manipulated by the coder. Perhaps calling this {{DecoderData}} or {{DecoderStream}} is more appropriate. Having this new class will also avoid the need to manually pass around so many arrays and indices. bq. AbstractRawErasureCoder maintains conf, schema related info like numDataUnits, and coderOptions. It provides public methods (9 ones) to access these fields. All of these are essentials to a erasure coder and common to both encoders and decoders. If we move the variables and methods to a utility class, it wouldn't look better, and we have to duplicate the methods across encoder and decoder. These methods and fields are configuration methods and configuration fields. They belong in a class named something like "ErasureEncodingConfiguration" or something like that. I also feel that configuration should be immutable once it is created, since otherwise things get very messy. We use this pattern in many other cases in Hadoop: for example in {{DfsClientConf}} and {{DNConf.java}}. Why is having the configuration in a separate object better than having the configuration in a base class? A few reasons: * It's easier to follow the flow of control. You don't have to jump from file to file to figure out which method is actually getting called (any subclass could override the base class methods we have now). * It's obvious what a CoderConfiguration class does. It manages the configuration. It's not obvious what the base class does without reading all of the source. * The configuration class can have a way of printing itself (object-orientation) Many of these are reasons why the gang of four recommended "*favor composition over inheritance*." bq. Yes it's interesting. I just thought of an exact match for the current codes. In JRE sasl framework, it has interfaces SaslClient and SaslSever, abstract classes AbstractSaslImpl and GssKrb5Base, class GssKrb5Client extends GssKrb5Base implements SaslClient, and class GssKrb5Server extends GssKrb5Base implements SaslServer. I'm not sure we followed the style but I guess it could be a common pattern for a bit of complex situation. I thought that's why when it initially went in this way people understood the codes and I heard no other voice then. We have to understand the reasons behind using a pattern. The reason to separate interfaces from Abstract classes is that some implementations of the interface may not want to use the code in the Abstract class. Since that is not the case here, it's not a good idea to copy this pattern. bq. Generally and often, I have to admit that I'm more a OOP guy and prefer to have clear construct over concept and abstract, rather than mixed utilities. We can see many example utilities in the codebase that are rather lengthy and messy, which intends to break modularity. That's probably why I'm not feeling so well to get rid of the level and replace it with utilities here. I agree with you that sometimes composition is good to reuse some codes to avoid complex inheritance relationships, but here we do have a coder concept and the construct for it wouldn't be bad to have. Using composition doesn't mean putting everything into utilities. Often, it means grouping related things into objects. Instead of having 20 fields sprayed into a base class that other classes inherit from, you have a small number of utility classes such as Configuration, Log, etc. that other classes reuse by composition (owning an instance of them). This also makes it easy to change the code later. Changing inheritance relationships often breaks backwards compatibility. Removing a field or adding a new one almost never does. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 >
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243509#comment-15243509 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. The problem is, a decoder associates expensive coding buffers and computed coding matrices, which would be good to stay in CPU core near enough caches for the performance. The cached data is per decoder, not only schema specific, but also erasure index specific in decode call, so it's not good to keep the cache out of decoder, but still makes sense to cache it because in HDFS side it's repeatedly called in a loop for a large block size (64k cell size -> 256mb block size). You might have a check about the native codes for native coders about the expensive buffers and data cached in every decode call. We had benchmarked the coders and showed this optimization obtained great speedup. Java InputStreams are similar to here, but not exactly because it's pure view-only and leverages OS/IO level caches for file reading stuffs. If I understand correctly, you're making the case that there is data (such as matrices) which should be shared between multiple concurrent encode or decode operations. If that's the case, then let's make that data static and share it between all instances. But I still think that Encoder/Decoder should manage its own buffers rather than having them passed in on every call. bq. Having the common base class would allow encoder and decoder to share common properties, not just configurations, but also schema info and some options. We can also say that encoder and decoder are also coders, which allows to write some common behaviors to deal with coders, not encoder or decoder specific. I understand it should also work by composition, but right now I don't see very much benefits to switch this from one style to the other, or troubles if we don't. Hmm. The only state in {{AbstractRawErasureCoder.java}} is configuration state. I don't see why we need this class. Everything in there could and should be a utility function. The benefit of getting rid of this class is that with a shallower inheritance hierarchy, it's easier to understand what's going on. To continue the analogy with Java, InputStream and OutputStream don't share a common base class. bq. It sounds better not to have the interfaces since the benefit is obvious. So in summary how about having these classes (no interface) now: still AbstractRawErasureCoder, RawErasureEncoder/Decoder (no Abstract prefix now, with the original interface combined), and all kinds of concrete inherent encoders/decoders. All client codes will declare RawErasureEncoder/Decoder type when creating instances. It seems reasonable, but I don't see the need for AbstractRawErasureCoder. > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242157#comment-15242157 ] Colin Patrick McCabe edited comment on HADOOP-13010 at 4/14/16 11:53 PM: - bq. The underlying buffer for the empty trunk is assumed read-only and will only be used to zero coding buffers. Making the entire function safe and also private is a good idea because in practice that level should be good enough. Right. The arrays themselves are read-only. But we still have to control access to the pointer to the array, which is not read-only and which needs to be accessed in a thread-safe fashion. bq. For pure Java coders that use byte array and on-heap bytebuffer, this way to zero buffers is efficient (perhaps the most one but I'm not totally sure); to zero direct bytebuffer the more efficient way would be to use an empty direct bytebuffer. I don't optimize this because pure Java coder is better not to use direct bytebuffer overall. Note native coders will prefer direct bytebuffer but won't need to bump into this, as we discussed in HADOOP-11540. Yeah, the JNI encoders can be more efficient, so we don't have to worry about optimizing this. I was just commenting that it's unfortunate that we have to keep around the empty array. bq. Ok. Comment can be made here to tell the null indexes include erased units and the units that's not to read. The function just finds null array entries. What these entries mean is up to the caller. bq. Because I want \[the first element\] to return fast considering it's the most often case. I don't see any evidence that adding a special case makes this faster than just running the loop. The loop starts at the first element anyway. If the loop usually stops after the first iteration, I would expect the just-in-time compiler to optimize this code. Let's get rid of the special case, unless we have some benchmarks showing that it helps. bq. \[Decoders are\] intended not to be stateful, thus many threads can use the same decoder instance. I'm not sure all the existing coders are already good in this aspect, but effort will be made to achieve so if necessary, not sure all be done here. Part of the appeal of object-oriented programming is to combine the data with the methods used to operate on that data. I'm not sure why we would want to keep the decoder state separate from the decoder functions. If we want to do multiple decode operations in parallel, we can just create multiple Decoder objects, right? Java InputStreams don't have an InputStreamState that you have to pass in to every function. Instead, if you want multiple views of the same file, you just create multiple streams. It seems like we can take the same approach here. was (Author: cmccabe): bq. The underlying buffer for the empty trunk is assumed read-only and will only be used to zero coding buffers. Making the entire function safe and also private is a good idea because in practice that level should be good enough. Right. The arrays themselves are read-only. But we still have to control access to the pointer to the array, which is not read-only and which needs to be accessed in a thread-safe fashion. bq. For pure Java coders that use byte array and on-heap bytebuffer, this way to zero buffers is efficient (perhaps the most one but I'm not totally sure); to zero direct bytebuffer the more efficient way would be to use an empty direct bytebuffer. I don't optimize this because pure Java coder is better not to use direct bytebuffer overall. Note native coders will prefer direct bytebuffer but won't need to bump into this, as we discussed in HADOOP-11540. Yeah, the JNI encoders can be more efficient, so we don't have to worry about optimizing this. I was just commenting that it's unfortunate that we have to keep around the empty array. bq. Ok. Comment can be made here to tell the null indexes include erased units and the units that's not to read. The function just finds null array entries. What these entries mean is up to the caller. bq. Because I want \[the first element\] to return fast considering it's the most often case. I don't see any evidence that adding a special case makes this faster than just running the loop. The loop starts at the first element anyway. If the loop usually stops after the first iteration, I would expect the just-in-time compiler to optimize this code. Let's get rid of the special case, unless we have some benchmarks showing that it helps. bq. \[Decoders are\] intended not to be stateful, thus many threads can use the same decoder instance. I'm not sure all the existing coders are already good in this aspect, but effort will be made to achieve so if necessary, not sure all be done here. Part of the appeal of object-oriented programming is to combine the data with the methods
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242157#comment-15242157 ] Colin Patrick McCabe commented on HADOOP-13010: --- bq. The underlying buffer for the empty trunk is assumed read-only and will only be used to zero coding buffers. Making the entire function safe and also private is a good idea because in practice that level should be good enough. Right. The arrays themselves are read-only. But we still have to control access to the pointer to the array, which is not read-only and which needs to be accessed in a thread-safe fashion. bq. For pure Java coders that use byte array and on-heap bytebuffer, this way to zero buffers is efficient (perhaps the most one but I'm not totally sure); to zero direct bytebuffer the more efficient way would be to use an empty direct bytebuffer. I don't optimize this because pure Java coder is better not to use direct bytebuffer overall. Note native coders will prefer direct bytebuffer but won't need to bump into this, as we discussed in HADOOP-11540. Yeah, the JNI encoders can be more efficient, so we don't have to worry about optimizing this. I was just commenting that it's unfortunate that we have to keep around the empty array. bq. Ok. Comment can be made here to tell the null indexes include erased units and the units that's not to read. The function just finds null array entries. What these entries mean is up to the caller. bq. Because I want \[the first element\] to return fast considering it's the most often case. I don't see any evidence that adding a special case makes this faster than just running the loop. The loop starts at the first element anyway. If the loop usually stops after the first iteration, I would expect the just-in-time compiler to optimize this code. Let's get rid of the special case, unless we have some benchmarks showing that it helps. bq. \[Decoders are\] intended not to be stateful, thus many threads can use the same decoder instance. I'm not sure all the existing coders are already good in this aspect, but effort will be made to achieve so if necessary, not sure all be done here. Part of the appeal of object-oriented programming is to combine the data with the methods used to operate on that data. I'm not sure why we would want to keep the decoder state separate from the decoder functions. If we want to do multiple decode operations in parallel, we can just create multiple Decoder objects, right? > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-13010-v1.patch, HADOOP-13010-v2.patch > > > This will refactor raw erasure coders according to some comments received so > far. > * As discussed in HADOOP-11540 and suggested by [~cmccabe], better not to > rely class inheritance to reuse the codes, instead they can be moved to some > utility. > * Suggested by [~jingzhao] somewhere quite some time ago, better to have a > state holder to keep some checking results for later reuse during an > encode/decode call. > This would not get rid of some inheritance levels as doing so isn't clear yet > for the moment and also incurs big impact. I do wish the end result by this > refactoring will make all the levels more clear and easier to follow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241805#comment-15241805 ] Colin Patrick McCabe edited comment on HADOOP-12975 at 4/14/16 8:08 PM: bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other waits for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as it's configurable and the defaults are reasonable. I definitely think that a maximum jitter percentage of 0.15 or 0.20 seems more reasonable than 0.5. was (Author: cmccabe): bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other waits for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as it's configurable and the defaults are reasonable. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241805#comment-15241805 ] Colin Patrick McCabe edited comment on HADOOP-12975 at 4/14/16 8:07 PM: bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other waits for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as it's configurable and the defaults are reasonable. was (Author: cmccabe): bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other waits for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as its configurable and the defaults are reasonable. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241805#comment-15241805 ] Colin Patrick McCabe edited comment on HADOOP-12975 at 4/14/16 8:07 PM: bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other waits for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as its configurable and the defaults are reasonable. was (Author: cmccabe): bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other ways for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as its configurable and the defaults are reasonable. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12975) Add jitter to CachingGetSpaceUsed's thread
[ https://issues.apache.org/jira/browse/HADOOP-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241805#comment-15241805 ] Colin Patrick McCabe commented on HADOOP-12975: --- bq. But a percentage is chosen as it makes the jitter scale with anyone who changes du periods. If it's a set number then someone with a refresh period of days won't get any benefit from the jitter. Hmm. It seems like a fixed amount of jitter still provides a benefit, even to someone with a longer refresh interval. Let's say my refresh period is 7 days. At the end of that, I would still appreciate having my DU processes launch at slightly different times on the 7th day, rather than all launching at once. My concern with varying based on a percentage is that there will be enormous variations in how long different volumes go between DU operations, when longer refresh intervals are in use. Like if I have a 7 day period and one volume refreshes after 3.5 days, and the other ways for the full 7 days, that's quite a variation. Similarly, if our period is short -- like 1 hour-- having some datanodes refresh after only 30 minutes seems unwelcome. That's why I suggested a fixed jitter amount, to be configured by the sysadmin. I don't feel very strongly about this, though, so if you want to make it percentage-based, that's fine too. As long as its configurable and the defaults are reasonable. > Add jitter to CachingGetSpaceUsed's thread > -- > > Key: HADOOP-12975 > URL: https://issues.apache.org/jira/browse/HADOOP-12975 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: 2.9.0 >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12975v0.patch, HADOOP-12975v1.patch, > HADOOP-12975v2.patch > > > Running DU across lots of disks is very expensive and running all of the > processes at the same time creates a noticeable IO spike. We should add some > jitter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241679#comment-15241679 ] Colin Patrick McCabe edited comment on HADOOP-13010 at 4/14/16 6:29 PM: Thanks for this, [~drankye]. Looks good overall! I like the idea of moving some of the utility stuff into {{CoderUtil.java}}. {code} static byte[] getEmptyChunk(int leastLength) { if (emptyChunk.length >= leastLength) { return emptyChunk; // In most time } synchronized (AbstractRawErasureCoder.class) { emptyChunk = new byte[leastLength]; } return emptyChunk; } {code} This isn't safe for multiple threads, since we could be reading {{CoderUtil#emptyChunk}} while it's in the middle of being written. You must either make this {{volatile}} or hold the lock for this entire function. It's unfortunate that we need a function like this-- I was hoping that there would be some more efficient way of zeroing a ByteBuffer. One thing that's a little concerning here is that a caller could modify the array returned by {{getEmptyChunk}}, which would cause problems for other callers. To avoid this, it's probably better to make this {{private}} to {{CoderUtil.java}}. {code} static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) { {code} Hmm. This name seems a bit confusing. What this function does has nothing to do with whether the buffer is for "input" versus "output"-- it's just copying data from an array to a {{DirectByteBuffer}}. It's also not so much a "conversion" as a "copy". Maybe something like {{cloneAsDirectByteBuffer}} would be a better name? {code} static int[] getErasedOrNotToReadIndexes(T[] inputs) { {code} Should be named {{getNullIndexes}}? {code} static T findFirstValidInput(T[] inputs) { if (inputs.length > 0 && inputs[0] != null) { return inputs[0]; } for (T input : inputs) { if (input != null) { return input; } } ... {code} Why do we need the special case for the first element here? {code} static void makeValidIndexes(T[] inputs, int[] validIndexes) { {code} Should be named {{getNonNullIndexes}}? Also, why does this one take an array passed in, whereas {{getNullIndexes}} returns an array? I also don't see how the caller is supposed to know how many of the array slots were used by the function. If the array starts as all zeros, that is identical to the function putting a zero in the first element of the array and then returning, right? Perhaps we could mandate that the caller set all the array slots to a negative value before calling the function, but that seems like an awkward calling convention-- and certainly one that should be documented via JavaDoc. {code} @Override protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs, int[] erasedIndexes, ByteBuffer[] outputs) { {code} I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}? These are stateful objects, I assume. Continuing my comments from earlier: * {{AbstractRawErasureCoder}} -- why do we need this base class? Its function seems to be just storing configuration values. Perhaps we'd be better off just having an {{ErasureEncodingConfiguration}} class which other objects can own (not inherit from). I think of a configuration as something you *own*, not something you *are*, which is why I think composition would make more sense here. Also, is it possible for this to be immutable? Mutable configuration is a huge headache (another reason I dislike {{Configured.java}}) * {{AbstractRawErasureEncoder}} /{{AbstractRawErasureDecoder}} -- why are these classes separate from {{RawErasureEncoder}} / {{RawErasureDecoder}}? Do we expect that any encoders will implement {{RawErasureEncoder}}, but not extend {{AbstractRawErasureEncoder}}? If not, it would be better just to have two base classes here rather than 2 classes and 2 interfaces. Base classes are also easier to extend in the future than interfaces because you can add new methods without breaking backwards compatibility (as long as you have a default in the base). * {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than "Dummy". Is this intended to be used just in unit tests, or is it something the end-user should be able to configure? If it is just unit tests, it should be under a {{test}} path, rather than a {{main}} path... i.e. {{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}} was (Author: cmccabe): Thanks for this, [~drankye]. Looks good overall! I like the idea of moving some of the utility stuff into {{CoderUtil.java}}. {code} static byte[] getEmptyChunk(int leastLength) { if (emptyChunk.length >= leastLength) { return emptyChunk; // In most time
[jira] [Commented] (HADOOP-13010) Refactor raw erasure coders
[ https://issues.apache.org/jira/browse/HADOOP-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241679#comment-15241679 ] Colin Patrick McCabe commented on HADOOP-13010: --- Thanks for this, [~drankye]. Looks good overall! I like the idea of moving some of the utility stuff into {{CoderUtil.java}}. {code} static byte[] getEmptyChunk(int leastLength) { if (emptyChunk.length >= leastLength) { return emptyChunk; // In most time } synchronized (AbstractRawErasureCoder.class) { emptyChunk = new byte[leastLength]; } return emptyChunk; } {code} This isn't safe for multiple threads, since we could be reading {{CoderUtil#emptyChunk}} while it's in the middle of being written. You must either make this {{volatile}} or hold the lock for this entire function. It's unfortunate that we need a function like this-- I was hoping that there would be some more efficient way of zeroing a ByteBuffer. One thing that's a little concerning here is that a caller could modify the array returned by {{getEmptyChunk}}, which would cause problems for other callers. To avoid this, it's probably better to make this {{private}} to {{CoderUtil.java}}. {code} static ByteBuffer convertInputBuffer(byte[] input, int offset, int len) { {code} Hmm. This name seems a bit confusing. What this function does has nothing to do with whether the buffer is for "input" versus "output"-- it's just copying data from an array to a {{DirectByteBuffer}}. It's also not so much a "conversion" as a "copy". Maybe something like {{cloneAsDirectByteBuffer}} would be a better name? {code} static int[] getErasedOrNotToReadIndexes(T[] inputs) { {code} Should be named {{getNullIndexes}}? {code} static T findFirstValidInput(T[] inputs) { if (inputs.length > 0 && inputs[0] != null) { return inputs[0]; } for (T input : inputs) { if (input != null) { return input; } } ... {code} Why do we need the special case for the first element here? {code} static void makeValidIndexes(T[] inputs, int[] validIndexes) { {code} Should be named {{getNonNullIndexes}}? Also, why does this one take an array passed in, whereas {{getNullIndexes}} returns an array? I also don't see how the caller is supposed to know how many of the array slots were used by the function. If the array starts as all zeros, that is identical to the function putting a zero in the first element of the array and then returning, right? Perhaps we could mandate that the caller set all the array slots to a negative value before calling the function, but that seems like an awkward calling convention-- and certainly one that should be documented via JavaDoc. {code} @Override protected void doDecode(DecoderState decoderState, ByteBuffer[] inputs, int[] erasedIndexes, ByteBuffer[] outputs) { {code} I'm not sure why we wouldn't just store {{DecoderState}} in the {{Decoder}}? These are stateful objects, I assume. Continuing my comments from earlier: * {{AbstractRawErasureCoder}} -- why do we need this base class? Its function seems to be just storing configuration values. Perhaps we'd be better off just having an {{ErasureEncodingConfiguration}} class which other objects can own (not inherit from). I think of a configuration as something you *own*, not something you *are*, which is why I think composition would make more sense here. Also, is it possible for this to be immutable? Mutable configuration is a huge headache (another reason I dislike {{Configured.java}}) * {{AbstractRawErasure{En,De}coder}} -- why are these classes separate from {{RawErasureEncoder}} / {{RawErasureDecoder}}? Do we expect that any encoders will implement {{RawErasureEncoder}}, but not extend {{AbstractRawErasureEncoder}}? If not, it would be better just to have two base classes here rather than 2 classes and 2 interfaces. Base classes are also easier to extend in the future than interfaces because you can add new methods without breaking backwards compatibility (as long as you have a default in the base). * {{DummyRawDecoder}} -- {{NoOpRawDecoder}} would be a better name than "Dummy". Is this intended to be used just in unit tests, or is it something the end-user should be able to configure? If it is just unit tests, it should be under a {{test}} path, rather than a {{main}} path... i.e. {{hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/erasurecode/rawcoder/DummyRawDecoder.java}} > Refactor raw erasure coders > --- > > Key: HADOOP-13010 > URL: https://issues.apache.org/jira/browse/HADOOP-13010 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Kai Zheng >Assignee: Kai Zheng > Fix For: 3.0.0 > > Attachments: HADOOP-
[jira] [Updated] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Patrick McCabe updated HADOOP-12973: -- Resolution: Fixed Fix Version/s: 2.8.0 Target Version/s: 2.8.0 Status: Resolved (was: Patch Available) > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Fix For: 2.8.0 > > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v12.patch, > HADOOP-12973v13.patch, HADOOP-12973v2.patch, HADOOP-12973v3.patch, > HADOOP-12973v5.patch, HADOOP-12973v6.patch, HADOOP-12973v7.patch, > HADOOP-12973v8.patch, HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238238#comment-15238238 ] Colin Patrick McCabe commented on HADOOP-12973: --- The fact that the tests pass for me locally, a different subset fails for each JVM, and the error message itself leads me to conclude that this is a build infrastructure problem, not a patch problem. Committed to trunk, 2.9, and 2.8. Thanks, [~eclark]. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v12.patch, > HADOOP-12973v13.patch, HADOOP-12973v2.patch, HADOOP-12973v3.patch, > HADOOP-12973v5.patch, HADOOP-12973v6.patch, HADOOP-12973v7.patch, > HADOOP-12973v8.patch, HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236389#comment-15236389 ] Colin Patrick McCabe edited comment on HADOOP-12973 at 4/12/16 1:13 AM: Cool. Thanks, [~eclark]. Hmm... TestDU failure looks related. +1 pending fixing that unit test was (Author: cmccabe): Cool. Thanks, [~eclark]. +1 > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v12.patch, > HADOOP-12973v2.patch, HADOOP-12973v3.patch, HADOOP-12973v5.patch, > HADOOP-12973v6.patch, HADOOP-12973v7.patch, HADOOP-12973v8.patch, > HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236389#comment-15236389 ] Colin Patrick McCabe edited comment on HADOOP-12973 at 4/12/16 1:11 AM: Cool. Thanks, [~eclark]. +1 was (Author: cmccabe): Cool. Thanks, [~eclark]. +1 pending jenkins. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v12.patch, > HADOOP-12973v2.patch, HADOOP-12973v3.patch, HADOOP-12973v5.patch, > HADOOP-12973v6.patch, HADOOP-12973v7.patch, HADOOP-12973v8.patch, > HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236389#comment-15236389 ] Colin Patrick McCabe commented on HADOOP-12973: --- Cool. Thanks, [~eclark]. +1 pending jenkins. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v12.patch, > HADOOP-12973v2.patch, HADOOP-12973v3.patch, HADOOP-12973v5.patch, > HADOOP-12973v6.patch, HADOOP-12973v7.patch, HADOOP-12973v8.patch, > HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233265#comment-15233265 ] Colin Patrick McCabe commented on HADOOP-12973: --- Thanks, [~eclark]. Looks good! The Windows code looks cleaner than before. {code} try { dfsUsage.close(); } catch (IOException ioe){ LOG.warn("Error trying to shutdown GetUsedSpace background thread", ioe); } {code} Can use {{IOUtils#cleanup}} here? I'm wonder if we could have GetSpaceUsed just be an interface with only one method... {{long getSpace()}} or something like that. A method that just synchronously retrieves the amount of space used, blocking for as long as it takes. Then, we could have another class which does all this thread management and value caching stuff. It seems unrelated to the {{GetSpaceUsed}} interface. Like if I'm implementing {{JNIGetSpaceUsed}}, I don't care about thread management. I just want to implement the method which gets the amount of space used, and leave the thread management the same. I think that's the direction you were going with the {{GetSpaceUsed}} base class, but feels messy to make the implementation classes reach back up into the base class and play with atomic variables. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v11.patch, HADOOP-12973v2.patch, > HADOOP-12973v3.patch, HADOOP-12973v5.patch, HADOOP-12973v6.patch, > HADOOP-12973v7.patch, HADOOP-12973v8.patch, HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11540) Raw Reed-Solomon coder using Intel ISA-L library
[ https://issues.apache.org/jira/browse/HADOOP-11540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233212#comment-15233212 ] Colin Patrick McCabe commented on HADOOP-11540: --- Thanks for your work on this, [~drankye]. It's making a lot of progress, I think. > Raw Reed-Solomon coder using Intel ISA-L library > > > Key: HADOOP-11540 > URL: https://issues.apache.org/jira/browse/HADOOP-11540 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: HDFS-7285 >Reporter: Zhe Zhang >Assignee: Kai Zheng > Attachments: HADOOP-11540-initial.patch, HADOOP-11540-v1.patch, > HADOOP-11540-v10.patch, HADOOP-11540-v2.patch, HADOOP-11540-v4.patch, > HADOOP-11540-v5.patch, HADOOP-11540-v6.patch, HADOOP-11540-v7.patch, > HADOOP-11540-v8.patch, HADOOP-11540-v9.patch, > HADOOP-11540-with-11996-codes.patch, Native Erasure Coder Performance - Intel > ISAL-v1.pdf > > > This is to provide RS codec implementation using Intel ISA-L library for > encoding and decoding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11540) Raw Reed-Solomon coder using Intel ISA-L library
[ https://issues.apache.org/jira/browse/HADOOP-11540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231306#comment-15231306 ] Colin Patrick McCabe commented on HADOOP-11540: --- Thanks, [~drankye]. Good progress here. bq. I agree it will be easier to understand. The only thing I'm not sure about is, there are at least 6 Java coders and 2 x 6 encode/decode functions right now, if adding a loop to reset the list of output buffers to each function, it looks like a major change here. That's why I put the common codes in the abstract class. Hmm. I still think changing the Java coders is the simplest thing to do. It's a tiny amount of code, or should be (calling one function), and simple to understand. bq. How about introducing AbstractJavaRawEncoder/AbstractJavaRawDecoder similar to the native ones for such things, then we can get rid of wantInitOutputs and don't have to change into each Java coders? I don't think this would be a good idea. We need to start thinking about simplifying the inheritance hierarchy and getting rid of some levels. We have too many non-abstract base classes, which makes it difficult to follow. Inheritance should not be used to accomplish code reuse, only to express a genuine is-a relationship. > Raw Reed-Solomon coder using Intel ISA-L library > > > Key: HADOOP-11540 > URL: https://issues.apache.org/jira/browse/HADOOP-11540 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: HDFS-7285 >Reporter: Zhe Zhang >Assignee: Kai Zheng > Attachments: HADOOP-11540-initial.patch, HADOOP-11540-v1.patch, > HADOOP-11540-v10.patch, HADOOP-11540-v2.patch, HADOOP-11540-v4.patch, > HADOOP-11540-v5.patch, HADOOP-11540-v6.patch, HADOOP-11540-v7.patch, > HADOOP-11540-v8.patch, HADOOP-11540-v9.patch, > HADOOP-11540-with-11996-codes.patch, Native Erasure Coder Performance - Intel > ISAL-v1.pdf > > > This is to provide RS codec implementation using Intel ISA-L library for > encoding and decoding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HADOOP-11540) Raw Reed-Solomon coder using Intel ISA-L library
[ https://issues.apache.org/jira/browse/HADOOP-11540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230976#comment-15230976 ] Colin Patrick McCabe commented on HADOOP-11540: --- Thanks, [~drankye]. {code} + /** + * Convert an output bytes array buffer to direct ByteBuffer. + * @param output + * @return direct ByteBuffer + */ + protected ByteBuffer convertOutputBuffer(byte[] output, int len) { +ByteBuffer directBuffer = ByteBuffer.allocateDirect(len); +return directBuffer; + } {code} Is it intentional that the "output" parameter is ignored here? bq. For initOutputs and resetBuffer, good catch! About this I initially thought as you suggested, instead of having initOutputs, just letting concrete coders to override resetBuffer, which would be most flexible. Then I realized for Java coders, a default behavior can be provided and used; for native coders, we can avoid having it because at the beginning of the encode() call the native coder can memset the output buffers directly. If instead the native coder has to provide resetBuffer, then a JNI function has to be added, which will be called some times to reset output buffers. Considering the overhead in both implementation and extra JNI calls, I used the initOutputs() approach. Thanks for the explanation. Why not just have the encode() function zero the buffer in every case? I don't see why the pure java code benefits from doing this differently-- and it is much simpler to understand if all the coders do it the same way. {code} void setCoder(JNIEnv* env, jobject thiz, IsalCoder* pCoder) { jclass clazz = (*env)->GetObjectClass(env, thiz); jfieldID fid = (*env)->GetFieldID(env, clazz, "nativeCoder", "J"); (*env)->SetLongField(env, thiz, fid, (jlong) pCoder); } {code} All these functions can fail. You need to check for, and handle their failures. isAllowingChangeInputs, isAllowingVerboseDump: should be {{allowChangeInputs}}, {{allowVerboseDump}} for clarity. > Raw Reed-Solomon coder using Intel ISA-L library > > > Key: HADOOP-11540 > URL: https://issues.apache.org/jira/browse/HADOOP-11540 > Project: Hadoop Common > Issue Type: Sub-task >Affects Versions: HDFS-7285 >Reporter: Zhe Zhang >Assignee: Kai Zheng > Attachments: HADOOP-11540-initial.patch, HADOOP-11540-v1.patch, > HADOOP-11540-v10.patch, HADOOP-11540-v2.patch, HADOOP-11540-v4.patch, > HADOOP-11540-v5.patch, HADOOP-11540-v6.patch, HADOOP-11540-v7.patch, > HADOOP-11540-v8.patch, HADOOP-11540-v9.patch, > HADOOP-11540-with-11996-codes.patch, Native Erasure Coder Performance - Intel > ISAL-v1.pdf > > > This is to provide RS codec implementation using Intel ISA-L library for > encoding and decoding. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230762#comment-15230762 ] Colin Patrick McCabe edited comment on HADOOP-12973 at 4/7/16 6:31 PM: --- bq. It makes it more obvious when someone overrides the class where things are. Hmm. How about making the class {{final}} instead? Re: {{DU}} versus {{WindowsDU}}. If you really want to separate the classes, I don't object, but I don't want the {{WindowsDU}} to be a subclass of the Linux {{DU}}. That is just weird. bq. Shutdown is needed. So it's very strange to have a shutdown without a start. There is a start-- in {{GetSpaceUsedBuilder}}. Having an "init" method that you have to call after initialization is an anti-pattern. There is no reason why the user should have to care whether {{GetSpaceUsedBuilder}} contains a thread or not-- many implementations won't need a thread. The fact that not all subclasses need threads is a good sign that thread management doesn't belong in the common interface. I'm also curious how you feel about the idea of making the interface {{Closeable}}, as we've done with many other interfaces such as {{FailoverProxyProvider}}, {{ServicePlugin}}, {{BlockReader}}, {{Peer}}, {{PeerServer}}, {{FsVolumeReference}}, etc. etc. was (Author: cmccabe): bq. It makes it more obvious when someone overrides the class where things are. Hmm. How about making the class {{final}} instead? Re: DU versus WindowsDU. If you really want to separate the classes, I don't object, but I don't want the WindowsDU to be a subclass of the Linux DU. That is just weird. bq. Shutdown is needed. So it's very strange to have a shutdown without a start. There is a start-- in GetSpaceUsedBuilder. Having an "init" method is an anti-pattern. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v2.patch, HADOOP-12973v3.patch, > HADOOP-12973v5.patch, HADOOP-12973v6.patch, HADOOP-12973v7.patch, > HADOOP-12973v8.patch, HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HADOOP-12973) make DU pluggable
[ https://issues.apache.org/jira/browse/HADOOP-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230762#comment-15230762 ] Colin Patrick McCabe edited comment on HADOOP-12973 at 4/7/16 6:32 PM: --- bq. It makes it more obvious when someone overrides the class where things are. Hmm. How about making the class {{final}} instead? Re: {{DU}} versus {{WindowsDU}}. If you really want to separate the classes, I don't object, but I don't want the {{WindowsDU}} to be a subclass of the Linux {{DU}}. That is just weird. bq. Shutdown is needed. So it's very strange to have a shutdown without a start. There is a start-- in {{GetSpaceUsedBuilder}}. Having an "init" method that you have to call after initialization is an anti-pattern. There is no reason why the user should have to care whether {{GetSpaceUsedBuilder}} contains a thread or not-- many implementations won't need a thread. The fact that not all subclasses need threads is a good sign that thread management doesn't belong in the common interface. I'm also curious how you feel about the idea of making the interface {{Closeable}}, as we've done with many other interfaces such as {{FailoverProxyProvider}}, {{ServicePlugin}}, {{BlockReader}}, {{Peer}}, {{PeerServer}}, {{FsVolumeReference}}, etc. etc. The compiler and various linters warn about failures to close {{Closeable}} objects in many cases, but not about failure to call custom shutdown funtions. was (Author: cmccabe): bq. It makes it more obvious when someone overrides the class where things are. Hmm. How about making the class {{final}} instead? Re: {{DU}} versus {{WindowsDU}}. If you really want to separate the classes, I don't object, but I don't want the {{WindowsDU}} to be a subclass of the Linux {{DU}}. That is just weird. bq. Shutdown is needed. So it's very strange to have a shutdown without a start. There is a start-- in {{GetSpaceUsedBuilder}}. Having an "init" method that you have to call after initialization is an anti-pattern. There is no reason why the user should have to care whether {{GetSpaceUsedBuilder}} contains a thread or not-- many implementations won't need a thread. The fact that not all subclasses need threads is a good sign that thread management doesn't belong in the common interface. I'm also curious how you feel about the idea of making the interface {{Closeable}}, as we've done with many other interfaces such as {{FailoverProxyProvider}}, {{ServicePlugin}}, {{BlockReader}}, {{Peer}}, {{PeerServer}}, {{FsVolumeReference}}, etc. etc. > make DU pluggable > - > > Key: HADOOP-12973 > URL: https://issues.apache.org/jira/browse/HADOOP-12973 > Project: Hadoop Common > Issue Type: Sub-task >Reporter: Elliott Clark >Assignee: Elliott Clark > Attachments: HADOOP-12973v0.patch, HADOOP-12973v1.patch, > HADOOP-12973v10.patch, HADOOP-12973v2.patch, HADOOP-12973v3.patch, > HADOOP-12973v5.patch, HADOOP-12973v6.patch, HADOOP-12973v7.patch, > HADOOP-12973v8.patch, HADOOP-12973v9.patch > > > If people are concerned about replacing the call to DU. Then an easy first > step is to make it pluggable. Then it's possible to replace it with something > while leaving the default alone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)