[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048953#comment-14048953 ] Hudson commented on HDFS-6591: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1818 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1818/]) HDFS-6591. while loop is executed tens of thousands of times in Hedged Read. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1606927) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048859#comment-14048859 ] Hudson commented on HDFS-6591: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1791 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1791/]) HDFS-6591. while loop is executed tens of thousands of times in Hedged Read. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1606927) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048767#comment-14048767 ] Hudson commented on HDFS-6591: -- FAILURE: Integrated in Hadoop-Yarn-trunk #600 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/600/]) HDFS-6591. while loop is executed tens of thousands of times in Hedged Read. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1606927) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048157#comment-14048157 ] Hudson commented on HDFS-6591: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5802 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5802/]) HDFS-6591. while loop is executed tens of thousands of times in Hedged Read. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1606927) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestPread.java > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Fix For: 3.0.0, 2.5.0 > > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047682#comment-14047682 ] Hadoop QA commented on HDFS-6591: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653132/HDFS-6591-v4.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7250//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7250//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047552#comment-14047552 ] Liang Xie commented on HDFS-6591: - To me, it should not a showstopper for current jira so far. what do you think [~cnauroth] ? > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047549#comment-14047549 ] Liang Xie commented on HDFS-6591: - Attached v4 should address [~andrew.purt...@gmail.com]'s comments. [~cnauroth], after a looking, it seems that it's a convention that adding "future.get()" after "CompletionService#poll", even Future, see this example: https://github.com/apache/hbase/blob/e4138a3a9408193e90dd6cc0db6d61320e370479/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/snapshot/RegionServerSnapshotManager.java#L319 I grep lots of open source code, all have this style, even don't need the result for further processing:) I didn't read through JDK FutureTask related impl so far, but my thought is that the JavaDoc for "CompletionService#poll" probably is not correct enough. Also i replaced "future.get" with "Thread.sleep(xx)", still got failed result, so it should not related with: "so this might indicate a race condition that had been masked by the future.get() taking up a few extra cycles", per my understanding so far. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591-v4.txt, > HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046877#comment-14046877 ] Chris Nauroth commented on HDFS-6591: - bq. ...the final result should be same due to that two threads have read out a same source result. I'd think the same thing too, but then there is a question of whether or not the buffer copy would be atomic or if it could somehow put the buffer into an inconsistent state temporarily. Anyway, that was just my initial theory. Maybe you'll find a different root cause when you take a look. :-) > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046875#comment-14046875 ] Liang Xie commented on HDFS-6591: - [~andrew.purt...@gmail.com], good idea! [~cnauroth], will dive into it definitely, nice finding! so that memory copy optimization is immature ? :) i will rethink it later. (I assumed even two threads mutate the buffer in a face condition, the final result should be same due to that two threads have read out a same source result.) > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046193#comment-14046193 ] Chris Nauroth commented on HDFS-6591: - I agree with Andrew. The problem with publishing a metric is that technically it becomes part of our contract. If tools start querying the metric (even accidentally), then removing it in a later version is backwards-incompatible. I'm holding off on committing anyway, because I may have found another problem. I wanted to suggest removing the {{future.get()}} from here: {code} if (future != null) { future.get(); return; } {code} {{CompletionService#poll}} guarantees that the returned task (if not null) has completed. We don't need the result of the task, so the {{get}} should be unnecessary. However, when I remove that line, I start getting a test failure in {{TestPread#testPreadDFS}}: {code} testPreadDFS(org.apache.hadoop.hdfs.TestPread) Time elapsed: 2.127 sec <<< FAILURE! java.lang.AssertionError: Pread Datanode Restart Test byte 0 differs. expected 47 actual 0 expected:<0> but was:<47> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.hdfs.TestPread.checkAndEraseData(TestPread.java:92) at org.apache.hadoop.hdfs.TestPread.datanodeRestartTest(TestPread.java:225) at org.apache.hadoop.hdfs.TestPread.dfsPreadTest(TestPread.java:439) at org.apache.hadoop.hdfs.TestPread.testPreadDFS(TestPread.java:250) {code} I've isolated the problem to 2 tests running in sequence: {{testMaxOutHedgedReadPool}} followed by {{testPreadDFS}}. Removing {{future.get()}} shouldn't make a difference, so this might indicate a race condition that had been masked by the {{future.get()}} taking up a few extra cycles. Something about the way {{testMaxOutHedgedReadPool}} fills the thread pool seems to set up the timing conditions just right to trigger the test failure. I'm not yet certain what's causing this, but I have a theory. The return value of {{getFromOneDataNode}} gets submitted as a {{Future}}. As you pointed out, that code mutates the buffer that was passed in. If we've already returned to the caller, and then a background task lands late and starts mutating the buffer, then we could see unexpected results. We do cancel the unstarted tasks, but we don't interrupt them if they're already running. Even if we did interrupt them, I don't think we could guarantee interruption before it mutates the buffer. Let me know your thoughts on this. Thanks again for working on this tricky patch! > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046146#comment-14046146 ] Andrew Purtell commented on HDFS-6591: -- {quote} bq. Is it actually useful to add the new metric for HedgedReadOpsLoopNum During debugging, i feel it's very convenient to catch/verify some unnecessary loops, after a quick thinking, i didn't find any other easier choice to catch such case. How about keeping it there. when we find it's a hotspot or this feature is stable enough, then remove it ? (I can add a comment in v3 patch, so we'll easy to recall those stuff) {quote} Would it be useful to keep around a @VisibleForTesting counter instead and a unit test that prints the value of that counter periodically during the run? Because the metric 'HedgedReadOpsLoopNum' has no utility for operations. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045632#comment-14045632 ] Hadoop QA commented on HDFS-6591: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652746/HDFS-6591-v3.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7237//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7237//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591-v3.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045491#comment-14045491 ] Liang Xie commented on HDFS-6591: - bq. so it appears that we never copy to buf in this code path. I only see a copy to buf inside the else path, when a hedged read actually happens. Did I miss something? Because in the if path, we passed this byte[] bb (ByteBuffer.wrap(buf, offset, len)), it came from the original end-user input parameter. that's why we don't need to do a explicit copy:) bq. Let's set isHedgedRead = false in a JUnit After method instead of at the end of individual methods. seems you read "HDFS-6591.txt" patch, not "HDFS-6591-v2.txt" ?:) i had moved it into a JUnit Before method per [~ajisakaa]'s suggestion, i assume it has a similar effect like moving into a After method, right? Please correct me if i'm wrong. bq. Let's guarantee that output and input get closed. We can put the cleanup in a finally block and call IOUtils#cleanup. yes, let me do it in patch v3. bq. Is it actually useful to add the new metric for HedgedReadOpsLoopNum During debugging, i feel it's very convenient to catch/verify some unnecessary loops, after a quick thinking, i didn't find any other easier choice to catch such case. How about keeping it there. when we find it's a hotspot or this feature is stable enough, then remove it ? (I can add a comment in v3 patch, so we'll easy to recall those stuff) > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044929#comment-14044929 ] Chris Nauroth commented on HDFS-6591: - [~liulei.cn], thank you for the bug report. [~xieliang007], nice work on the patch. This is a pretty tricky one! I think the logic is easier to follow now using {{CompletionService}} instead of {{CountDownLatch}}. I have a few comments and questions. # {{DFSInputStream#hedgedFetchBlockByteRange}}: I have a question on the following code, which runs when there is no need for a hedged read. (The initial read completes before the hedged read timeout.) {code} future = hedgedService.poll(dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS); if (future != null) { future.get(); return; } {code} The return value from {{future.get()}} is discarded, so it appears that we never copy to {{buf}} in this code path. I only see a copy to {{buf}} inside the else path, when a hedged read actually happens. Did I miss something? # {{DFSHedgedReadMetrics}}: Is it actually useful to add the new metric for HedgedReadOpsLoopNum, or was it just helpful for catching this bug? If it's not going to be useful in general, then maybe we should look for a different way to accommodate the test rather than publishing a new metric. # {{TestPread}}: Let's set {{isHedgedRead = false}} in a JUnit {{After}} method instead of at the end of individual methods. That way, if one test throws an exception and leaves the flag on by mistake, it won't disrupt the other tests in the same suite. Also, it seems this was done wrong for {{testMaxOutHedgedReadPool}}, which is setting it to {{true}} at the end. Using an {{After}} method would prevent those kinds of problems. # {{TestPread#testHedgedReadLoopTooManyTimes}}: Let's guarantee that {{output}} and {{input}} get closed. We can put the cleanup in a finally block and call {{IOUtils#cleanup}}. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044419#comment-14044419 ] Liang Xie commented on HDFS-6591: - The failed TestCheckpoint case is not related with current issue. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044393#comment-14044393 ] Hadoop QA commented on HDFS-6591: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652545/HDFS-6591-v2.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestCheckpoint {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7235//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7235//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044321#comment-14044321 ] Liang Xie commented on HDFS-6591: - Attached v2 should address the above comments. Thank you [~stack] [~ajisakaa] for reviewing ! > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591-v2.txt, HDFS-6591.txt, > LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044172#comment-14044172 ] Akira AJISAKA commented on HDFS-6591: - Nice fix to me. In TestPread.java, {code} } isHedgedRead = true; } {code} Would you please create {{@Before}} class and initialize variables there instead of setting at the last of {{@Test}} class like above? Minor nits: 1. In DFSInputStream.java:1107, {code} Future future = null; {code} Now that {{future}} is not used in the else clause, would you move the declaration into the try-catch clause? 2. There is a trailing white space in {code} +CompletionService hedgedService = {code} > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044093#comment-14044093 ] stack commented on HDFS-6591: - Nice test and added metrics [~xieliang007] Looks good on first pass. Let me give it another pass > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043443#comment-14043443 ] Hadoop QA commented on HDFS-6591: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652383/HDFS-6591.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7231//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7231//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043433#comment-14043433 ] Hadoop QA commented on HDFS-6591: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652383/HDFS-6591.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancer {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7232//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7232//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043411#comment-14043411 ] Hadoop QA commented on HDFS-6591: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652381/HDFS-6591.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.TestCacheDirectives {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7230//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7230//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043285#comment-14043285 ] Liang Xie commented on HDFS-6591: - Retry. After a debugging, showed a rare race. the CountDownLatch is inside Callable, but there's no guarantee: when a countDown happened, then one of tasks has done. see: http://stackoverflow.com/questions/9604713/future-isdone-returns-false-even-if-the-task-is-done . really tricky... I rewrote the sync related code in the latest patch. and passed all the TestPread case in a shell loop:) > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042067#comment-14042067 ] Hadoop QA commented on HDFS-6591: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652173/HDFS-6591.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestPread {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7223//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/7223//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7223//console This message is automatically generated. > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: HDFS-6591.txt, LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6591) while loop is executed tens of thousands of times in Hedged Read
[ https://issues.apache.org/jira/browse/HDFS-6591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041856#comment-14041856 ] Liang Xie commented on HDFS-6591: - Yes, it's a valid bug report, just verified on my box.[~cnauroth] The fix from HDFS-6231 has a minor side effect(the original behavior is countDown only after a successful actualGetFromOneDataNode call): since we moved the latch to finaly statement. then once a actualGetFromOneDataNode() failed, we still fire "latch.countDown();" , then getFirstToComplete() will pass the "latch.await" after that, then goes to "thorw new InterruptedException", so in hedgedFetchBlockByteRange(), we will continue to try to do getBestNodeDNAddrPair(), and in the test case, the replica is two and due to the previous failed actualGetFromOneDataNode() so one dn had been added into deadnode list:) so it looks like the loop in hedgedFetchBlockByteRange() is: {code} while(true) { getBestNodeDNAddrPair(); getFirstToComplete(); catch exception... } {code} I'll make a patch soon, thank you [~liulei.cn] ! > while loop is executed tens of thousands of times in Hedged Read > -- > > Key: HDFS-6591 > URL: https://issues.apache.org/jira/browse/HDFS-6591 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client >Affects Versions: 2.4.0 >Reporter: LiuLei >Assignee: Liang Xie > Attachments: LoopTooManyTimesTestCase.patch > > > I download hadoop-2.4.1-rc1 code from > http://people.apache.org/~acmurthy/hadoop-2.4.1-rc1/, I test the Hedged > Read. I find the while loop in hedgedFetchBlockByteRange method is executed > tens of thousands of times. -- This message was sent by Atlassian JIRA (v6.2#6252)