[jira] [Commented] (HBASE-10570) Allow overrides of Surefire secondPartForkMode and testFailureIgnore
[ https://issues.apache.org/jira/browse/HBASE-10570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906021#comment-13906021 ] Nicolas Liochon commented on HBASE-10570: - +1 Allow overrides of Surefire secondPartForkMode and testFailureIgnore Key: HBASE-10570 URL: https://issues.apache.org/jira/browse/HBASE-10570 Project: HBase Issue Type: Improvement Reporter: Andrew Purtell Assignee: Andrew Purtell Priority: Minor Fix For: 0.96.2, 0.98.1, 0.99.0 Attachments: 10570.patch It can be useful to override the fork and ignore behavior of our Surefire unit testing machinery. For example, this will put only one test into each generated surefire jar and always fork a child JVM to run it: {noformat} mvn test ... -Dsurefire.firstPartForkMode=always -Dsurefire.secondPartForkMode=always {noformat} Useful for isolating test failures and finding zombies. Sometimes it is also useful to ignore test failures: {noformat} mvn test ... -Dsurefire.testFailureIgnore=true {noformat} This will let test suite execution continue even if a test times out. Maven has CLI options for this sort of thing but they don't always work as expected. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop
[ https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903913#comment-13903913 ] Nicolas Liochon commented on HBASE-10516: - I'm ok with the new patch. Thanks! I didn't see the DeleteTableHandler / AssignmentManager / LruBlockCache / ZooKeeperWatcher changes yesterday. DeleteTableHandler = I think we can just throw an IOE exception here, it will be as if the loop timeouted. (see the {code} throw new IOException(Waited hbase.master.wait.on.region ( + waitTime + ms) for region to leave region + region.getRegionNameAsString() + in transitions); {code} AssignmentManager = It's not really managed today: the IE is transformed in a RuntimeException. The patch will increase the probability. I would propose to explicitly throw IE here from the method here. In any case, it's a risky patch, but throwing an exception seems clearer. LruBlockCache We don't need to continue to loop here. We can stop looping at the first interrupt (and may be we should restore the status?) ZooKeeperWatcher We can manage the interrupt as a timeout (so just throwing a Runtime: today we throw a NPE (!) on timeout, we can do the same or a little bit cleaner). Sorry for missing this part yesterday. I will have a look at the other patshes as well. Basically, I think it's better to manage the exception when we do the patch, because anyway, we need to think about the impact when we do the default strategy of restoring the status. And throwing and exception is often better, and requires less code, so... Refactor code where Threads.sleep is called within a while/for loop --- Key: HBASE-10516 URL: https://issues.apache.org/jira/browse/HBASE-10516 Project: HBase Issue Type: Bug Components: Client, master, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch Threads.sleep implementation: {code} public static void sleep(long millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); Thread.currentThread().interrupt(); } } {code} From above implementation, the current thread's interrupt status is restored/reset when InterruptedException is caught and handled. If this method is called within a while/for loop, if a first InterruptedException is thrown during sleep, it will make the Threads.sleep in next loop immediately throw InterruptedException without expected sleep. This behavior breaks the intention for independent sleep in each loop I mentioned above in HBASE-10497 and this jira is created to handle it separately per [~nkeywal]'s suggestion -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Reopened] (HBASE-10539) HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead
[ https://issues.apache.org/jira/browse/HBASE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon reopened HBASE-10539: - HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead -- Key: HBASE-10539 URL: https://issues.apache.org/jira/browse/HBASE-10539 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10539-trunk_v1.patch HRegion.addAndGetGlobalMemstoreSize(addedSize) is called once some write succeeds and 'addedSize' is the size of the edits newly put to the memstore, the returned value of HRegion.addAndGetGlobalMemstoreSize(addedSize) is then checked against the flush threshold to determine if a flush for the region should be triggered. By design the returned value should be the updated memstore size after adding 'addedSize', but current implementation uses this.memstoreSize.getAndAdd which returns the previous size before adding, actually 'addAndGet' rather than 'getAndAdd' should be used here. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10539) HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead
[ https://issues.apache.org/jira/browse/HBASE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903920#comment-13903920 ] Nicolas Liochon commented on HBASE-10539: - Reopening, per comments above :-) HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead -- Key: HBASE-10539 URL: https://issues.apache.org/jira/browse/HBASE-10539 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10539-trunk_v1.patch HRegion.addAndGetGlobalMemstoreSize(addedSize) is called once some write succeeds and 'addedSize' is the size of the edits newly put to the memstore, the returned value of HRegion.addAndGetGlobalMemstoreSize(addedSize) is then checked against the flush threshold to determine if a flush for the region should be triggered. By design the returned value should be the updated memstore size after adding 'addedSize', but current implementation uses this.memstoreSize.getAndAdd which returns the previous size before adding, actually 'addAndGet' rather than 'getAndAdd' should be used here. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop
[ https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903994#comment-13903994 ] Nicolas Liochon commented on HBASE-10516: - bq. Should we treat it case-by-case, or per-class, or uniformly for all occurrences? There is a general strategy, then the exceptions :-). I think we agree on the general strategy, let me formulate it to confirm: - never never never shallow IE - if possible, stop the current task, if not possible, continue, and manage the interruption when the non interruptible task has been done - try to throw a clear exception (IE or IOE), if not restore the status catch and rethrow a RuntimeException instead is generally bad. It means that you do not accept to be interrupted. So we should not add such kind of code. But in such a code void f(){ partOne(); //shallows exception partTwo(); // throw Runtime on Exception } It's acceptable, imho, to change the part one to make it throws the same runtime as the part 2. Refactor code where Threads.sleep is called within a while/for loop --- Key: HBASE-10516 URL: https://issues.apache.org/jira/browse/HBASE-10516 Project: HBase Issue Type: Bug Components: Client, master, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch Threads.sleep implementation: {code} public static void sleep(long millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); Thread.currentThread().interrupt(); } } {code} From above implementation, the current thread's interrupt status is restored/reset when InterruptedException is caught and handled. If this method is called within a while/for loop, if a first InterruptedException is thrown during sleep, it will make the Threads.sleep in next loop immediately throw InterruptedException without expected sleep. This behavior breaks the intention for independent sleep in each loop I mentioned above in HBASE-10497 and this jira is created to handle it separately per [~nkeywal]'s suggestion -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop
[ https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904012#comment-13904012 ] Nicolas Liochon commented on HBASE-10516: - ok :-) Refactor code where Threads.sleep is called within a while/for loop --- Key: HBASE-10516 URL: https://issues.apache.org/jira/browse/HBASE-10516 Project: HBase Issue Type: Bug Components: Client, master, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10516-trunk_v1.patch, HBASE-10516-trunk_v2.patch Threads.sleep implementation: {code} public static void sleep(long millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); Thread.currentThread().interrupt(); } } {code} From above implementation, the current thread's interrupt status is restored/reset when InterruptedException is caught and handled. If this method is called within a while/for loop, if a first InterruptedException is thrown during sleep, it will make the Threads.sleep in next loop immediately throw InterruptedException without expected sleep. This behavior breaks the intention for independent sleep in each loop I mentioned above in HBASE-10497 and this jira is created to handle it separately per [~nkeywal]'s suggestion -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10539) HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead
[ https://issues.apache.org/jira/browse/HBASE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904028#comment-13904028 ] Nicolas Liochon commented on HBASE-10539: - Yes :-). As well, could you set the versions (affects fix). Fix is especially important if you want the fix to be committed in non trunk versions. HRegion.addAndGetGlobalMemstoreSize() is expected to return the new memstore size after added, but actually the previous size before added is returned instead -- Key: HBASE-10539 URL: https://issues.apache.org/jira/browse/HBASE-10539 Project: HBase Issue Type: Bug Components: regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10539-trunk_v1.patch HRegion.addAndGetGlobalMemstoreSize(addedSize) is called once some write succeeds and 'addedSize' is the size of the edits newly put to the memstore, the returned value of HRegion.addAndGetGlobalMemstoreSize(addedSize) is then checked against the flush threshold to determine if a flush for the region should be triggered. By design the returned value should be the updated memstore size after adding 'addedSize', but current implementation uses this.memstoreSize.getAndAdd which returns the previous size before adding, actually 'addAndGet' rather than 'getAndAdd' should be used here. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Status: Open (was: Patch Available) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Attachment: 10525.v2.patch Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13904313#comment-13904313 ] Nicolas Liochon commented on HBASE-10525: - v2 comments above taken into account htace hooks added (is that the right way to do that, [~eclark] ?) optimization / race condition fixed in cleanupCalls simplification in waitForWork I will push a v3 with the option set to false to get an hadoop qa run with the two settings Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Attachment: (was: 10525.v2.patch) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Status: Patch Available (was: Open) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10548) PerfEval work around wrong runtime dependency version
[ https://issues.apache.org/jira/browse/HBASE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903239#comment-13903239 ] Nicolas Liochon commented on HBASE-10548: - We can commit the change, but there is something I don't get, at least for trunk (I haven't checked 0.98 0.96): hbase pom seems clean, and does require common:math:2.2, whatever the hadoop version. How comes we're finally installed with the 2.1 version? I mean, we don't care for this specific case, but if HBase requires the 2.2 version, it should not be deployed with an older one. Here the error message is clear and it's in the test code, but it could be less clear and in the production code... PerfEval work around wrong runtime dependency version - Key: HBASE-10548 URL: https://issues.apache.org/jira/browse/HBASE-10548 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.96.2, 0.98.1, 0.99.0 Reporter: Nick Dimiduk Assignee: Nick Dimiduk Priority: Minor Attachments: HBASE-10548.00.patch From my [comment|https://issues.apache.org/jira/browse/HBASE-10511?focusedCommentId=13902238page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13902238] on HBASE-10511: I have hadoop-1.2.1 installed from tgz, which packages commons-math-2.1. This is *different* from the listed maven dependency, 2.2. {noformat} $ tar tvf hadoop-1.2.1.tar.gz | grep commons-math -rw-rw-r-- 0 0 0 832410 Jul 22 2013 hadoop-1.2.1/lib/commons-math-2.1.jar $ mvn -f pom.xml.hadoop1 dependency:tree | grep commons-math [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile {noformat} This is a problem because the 2.1 version of [DescriptiveStatistics|http://commons.apache.org/proper/commons-math/javadocs/api-2.1/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html] doesn't have a double[] constructor. Running the MR job, mappers fail: {noformat} java.lang.NoSuchMethodError: org.apache.commons.math.stat.descriptive.DescriptiveStatistics.init([D)V at org.apache.hadoop.hbase.PerformanceEvaluation$RandomReadTest.testTakedown(PerformanceEvaluation.java:1163) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:984) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1401) at org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(PerformanceEvaluation.java:522) at org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(PerformanceEvaluation.java:474) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10548) PerfEval work around wrong runtime dependency version
[ https://issues.apache.org/jira/browse/HBASE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903292#comment-13903292 ] Nicolas Liochon commented on HBASE-10548: - bq. no sure what it means maven sees that hadoop asks for 2.1 while hbase asks for 2.2, so it builds with 2.2 (the newest version). That's maven doing its job. So we should have the 2.2 when we deploy. If we don't have this, it means that we don't use the expected versions of our third parties when they are different than the hadoop ones. PerfEval work around wrong runtime dependency version - Key: HBASE-10548 URL: https://issues.apache.org/jira/browse/HBASE-10548 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.96.2, 0.98.1, 0.99.0 Reporter: Nick Dimiduk Assignee: Nick Dimiduk Priority: Minor Attachments: HBASE-10548.00.patch From my [comment|https://issues.apache.org/jira/browse/HBASE-10511?focusedCommentId=13902238page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13902238] on HBASE-10511: I have hadoop-1.2.1 installed from tgz, which packages commons-math-2.1. This is *different* from the listed maven dependency, 2.2. {noformat} $ tar tvf hadoop-1.2.1.tar.gz | grep commons-math -rw-rw-r-- 0 0 0 832410 Jul 22 2013 hadoop-1.2.1/lib/commons-math-2.1.jar $ mvn -f pom.xml.hadoop1 dependency:tree | grep commons-math [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile (version managed from 2.1) [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile [INFO] | +- org.apache.commons:commons-math:jar:2.2:compile {noformat} This is a problem because the 2.1 version of [DescriptiveStatistics|http://commons.apache.org/proper/commons-math/javadocs/api-2.1/org/apache/commons/math/stat/descriptive/DescriptiveStatistics.html] doesn't have a double[] constructor. Running the MR job, mappers fail: {noformat} java.lang.NoSuchMethodError: org.apache.commons.math.stat.descriptive.DescriptiveStatistics.init([D)V at org.apache.hadoop.hbase.PerformanceEvaluation$RandomReadTest.testTakedown(PerformanceEvaluation.java:1163) at org.apache.hadoop.hbase.PerformanceEvaluation$Test.test(PerformanceEvaluation.java:984) at org.apache.hadoop.hbase.PerformanceEvaluation.runOneClient(PerformanceEvaluation.java:1401) at org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(PerformanceEvaluation.java:522) at org.apache.hadoop.hbase.PerformanceEvaluation$EvaluationMapTask.map(PerformanceEvaluation.java:474) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) {noformat} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop
[ https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903302#comment-13903302 ] Nicolas Liochon commented on HBASE-10516: - sleepBeforeRetry = sure createDir = why not thowing an InterrupedIOException? rename = same deleteDir = same HRegionServer#waitOnAllRegionsToClose = I suppose it's the best option For InterrupedIOException, my *feeling* is that we can already receive these exceptions today (as we're going network disk operations, they are likely to check the interrupt status already, or may do it in the future). Refactor code where Threads.sleep is called within a while/for loop --- Key: HBASE-10516 URL: https://issues.apache.org/jira/browse/HBASE-10516 Project: HBase Issue Type: Bug Components: Client, master, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10516-trunk_v1.patch Threads.sleep implementation: {code} public static void sleep(long millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); Thread.currentThread().interrupt(); } } {code} From above implementation, the current thread's interrupt status is restored/reset when InterruptedException is caught and handled. If this method is called within a while/for loop, if a first InterruptedException is thrown during sleep, it will make the Threads.sleep in next loop immediately throw InterruptedException without expected sleep. This behavior breaks the intention for independent sleep in each loop I mentioned above in HBASE-10497 and this jira is created to handle it separately per [~nkeywal]'s suggestion -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13903414#comment-13903414 ] Nicolas Liochon commented on HBASE-10355: - v2: - continue when it can't get a region, or if the number of replicas is below expectation - compiles with jdk 1.6 Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch, 10355.v2.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10355: Attachment: 10355.v2.patch Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch, 10355.v2.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901211#comment-13901211 ] Nicolas Liochon commented on HBASE-10511: - Thanks, Enis. Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.96.2, 0.98.1, 0.99.0, hbase-10070 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10516) Refactor code where Threads.sleep is called within a while/for loop
[ https://issues.apache.org/jira/browse/HBASE-10516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901213#comment-13901213 ] Nicolas Liochon commented on HBASE-10516: - It seems ok to me, but give me some time to review it in details (for today or Monday: I'm off this week end). Refactor code where Threads.sleep is called within a while/for loop --- Key: HBASE-10516 URL: https://issues.apache.org/jira/browse/HBASE-10516 Project: HBase Issue Type: Bug Components: Client, master, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10516-trunk_v1.patch Threads.sleep implementation: {code} public static void sleep(long millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); Thread.currentThread().interrupt(); } } {code} From above implementation, the current thread's interrupt status is restored/reset when InterruptedException is caught and handled. If this method is called within a while/for loop, if a first InterruptedException is thrown during sleep, it will make the Threads.sleep in next loop immediately throw InterruptedException without expected sleep. This behavior breaks the intention for independent sleep in each loop I mentioned above in HBASE-10497 and this jira is created to handle it separately per [~nkeywal]'s suggestion -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901221#comment-13901221 ] Nicolas Liochon commented on HBASE-10355: - bq. the information that he sees would be partial Hum. Yes, it's not well managed in the patch as it is. It's a little bit complex, because the contract of the getLocation is to retry, so the code expects the locations to be good. If we don't do that, we will have two level of retries (one in getLocation, one in the retrying callable). I can either manage the case explicitly, either getLocation should return a full good location set. On the same range, if we change dynamically (without disable / enable) the number of replica, the client we would be surprised. Lastly, as it is today, we stop on the first 'definitive' error: we don't wait for all replica to fails, we stop when one of the replica says 'it does not work'. It's a trade off, doing differently could make sense as well. bq. Maybe if the location is null, the client should hit the meta always? I can do that (force reload). About the jdk 1.7, it's the completion service that is not available on 1.6. It's not a big dependency, but makes the code much simpler and safer. We actually didn't conclude when we discussed 1.7 in this thread, 6 months ago. http://search-hadoop.com/m/X9ivp1d7Xoc2subj=Re+Default+JDK+settings+for+Jenkins+projects Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900140#comment-13900140 ] Nicolas Liochon commented on HBASE-10355: - Another dependency btw is that I'm using some jdk 1.7 API. This patch is awaiting reviews :-) Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900141#comment-13900141 ] Nicolas Liochon commented on HBASE-10490: - bq. redundant given that in setException there is a notify already being done yes, exactly. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10511: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900146#comment-13900146 ] Nicolas Liochon commented on HBASE-10511: - Committed as it is. If someone wants the raw data, we can use JMk/Nic idea of logging as trace/debug (but may be this someone will say that he doesn't want to use trace/debug because it changes the latency, this kind of of things. That's why I prefer to wait here). Thanks for the review. [~andrew.purt...@gmail.com] What's the process to put it in the .98 while a RC is in progress. Do I have to wait for the end of the RC? Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Attachment: 10490.v5.patch Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Open (was: Patch Available) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900161#comment-13900161 ] Nicolas Liochon commented on HBASE-10490: - v5 is what I will commit. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Patch Available (was: Open) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Attachment: 10490.v6.patch Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch, 10490.v6.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed, thanks for the reviews, all. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch, 10490.v6.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900361#comment-13900361 ] Nicolas Liochon commented on HBASE-10490: - The remaining issues are: - the broken rpc timeout - the sometimes strange synchronization But it's already a nice progress. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch, 10490.v5.patch, 10490.v6.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
Nicolas Liochon created HBASE-10525: --- Summary: Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Status: Patch Available (was: Open) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Assignee: (was: Nicolas Liochon) HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Attachment: 10525.v1.patch Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900418#comment-13900418 ] Nicolas Liochon commented on HBASE-10185: - The patch v2 has been incorporated in HBASE-10490 (trunk). We can use this jira for the 0.94 version. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Attachment: 10525.v2.patch Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900449#comment-13900449 ] Nicolas Liochon commented on HBASE-10525: - TestIPC uses a rpcTimeout of zero. Patch updated. Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Status: Patch Available (was: Open) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Status: Open (was: Patch Available) Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10524) Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver
[ https://issues.apache.org/jira/browse/HBASE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900465#comment-13900465 ] Nicolas Liochon commented on HBASE-10524: - Yeah, for HRegionServer I think you're right. Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver Key: HBASE-10524 URL: https://issues.apache.org/jira/browse/HBASE-10524 Project: HBase Issue Type: Sub-task Components: regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10524-trunk_v1.patch A sub-task of HBASE-10497 # correct wrong handling of InterruptedException where Thread.currentThread.interrupt() is called within while loops # add proper handling for swallowed InterruptedException -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10524) Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver
[ https://issues.apache.org/jira/browse/HBASE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900476#comment-13900476 ] Nicolas Liochon commented on HBASE-10524: - Agreed as well for SplitLogWorker. I would propose not to log the exception however. The existing code was doing it, but it seems wrong. We should log when we receive the interruption, not when we restore it. (note: you should be able to 'submit' a patch, this will trigger a hadoop-qa run). Correct wrong handling and add proper handling for swallowed InterruptedException thrown by Thread.sleep in regionserver Key: HBASE-10524 URL: https://issues.apache.org/jira/browse/HBASE-10524 Project: HBase Issue Type: Sub-task Components: regionserver Reporter: Feng Honghua Assignee: Feng Honghua Attachments: HBASE-10524-trunk_v1.patch A sub-task of HBASE-10497 # correct wrong handling of InterruptedException where Thread.currentThread.interrupt() is called within while loops # add proper handling for swallowed InterruptedException -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900495#comment-13900495 ] Nicolas Liochon commented on HBASE-10525: - bq. I dont follow the above exactly? It's basically a Java issue. Interrupting a process that is doing some i/o is complex, and difficult to do in the same way on all platforms. So the JVM guys went for the solution that was implementable on all platforms: close the socket when there is an interrupt. bq. What will 'name' be? The same name as the reader, postfixed with 'writer'). bq. yet another thread in client seems crazy Yeah. But I think that we're not far from being able to put a single pool for all the region servers, instead of 1 thread per region server as today (and with this new option on). There is this nasty rpcTimeout, and then we're clean enough to do the change I think. It's something I would like to do with a direct buffer write. Ir's more or less mandatory to scale on large sized cluster imho. bq. ThreadCallSender should be CallRunner or Call Executor I don't know. It does not execute much, it just writes on the socket. It uses the output stream of the connection, so it has to be an instance if I want to limit the impact on the existing code. bq. + callsToWrite.remove(cts); Do you have to cancel the future itself too? setting call.done allows the reading thread to skip this part removing it from the callsToWrite list allows us to not send the call to the remote server at all. So we do both. Now if the call was sent to the server, it's not stopped (at least with this patch :-), just that we will skip the answer. Canceling on the server is interesting, but on the other hand it's another rpc call. bq. Imlement Closeable since it has a close? Not necessary at all but... agreed. Will do. bq. Can you write up a big class comment on the mechanism you are instituting here so it is clear what is going on both for reviewers and for the folks who come along afterward trying to make sense of it all Sure. I'm polishing a doc as well that I will put in this jira. Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10525) Allow the client to use a different thread for writing to ease interrupt
[ https://issues.apache.org/jira/browse/HBASE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10525: Attachment: HBaseclient-EventualConsistency.pdf Allow the client to use a different thread for writing to ease interrupt Key: HBASE-10525 URL: https://issues.apache.org/jira/browse/HBASE-10525 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10525.v1.patch, 10525.v2.patch, HBaseclient-EventualConsistency.pdf This is an issue in the HBASE-10070 context, but as well more generally if you want to interrupt an operation with a limited cost. I will attach a doc with a more detailed explanation. This adds a thread per region server; so it's otional. The first patch activates it by default to see how it behaves on a full hadoop-qa run. The target is to be unset by default. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899132#comment-13899132 ] Nicolas Liochon commented on HBASE-10490: - As well, a rpcTimeout set to 0 might still work, or at least still work as it used to. The code around timeout is still difficult. I kept the ThreadLocal, but this code scares me, and clearly does not work for all cases. It's an issue for me, because when something goes wrong here I'm never sure it does not come from a race condition. As well, for example, we can set very low timeouts. So the cleanup on timeout is not in this jira, but will have to be done a day for sure. Anyway, here a a v2 with the typo fixed and the removal of the closeReason. Thanks for the feedback Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Open (was: Patch Available) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Attachment: 10490.v2.patch Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899138#comment-13899138 ] Nicolas Liochon commented on HBASE-10490: - btw, The v2 removes ~140 LoC in RpcClient (~10%) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10506) Fail-fast if client connection is lost before the real call be executed in RPC layer
[ https://issues.apache.org/jira/browse/HBASE-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899145#comment-13899145 ] Nicolas Liochon commented on HBASE-10506: - I think it would be better to log at the debug level, with a nice if LOG.isDebugEnabled, and to say something like client is disconnected, skipping + call. Fail-fast if client connection is lost before the real call be executed in RPC layer Key: HBASE-10506 URL: https://issues.apache.org/jira/browse/HBASE-10506 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.3 Reporter: Liang Xie Assignee: Liang Xie Fix For: 0.98.0, 0.96.2, 0.99.0, 0.94.17 Attachments: HBASE-10506-0.94.txt, HBASE-10506-trunk.txt In current HBase rpc impletement, there is no any connection double-checking just before the call be invoked, considing there's a gc or other OS scheduling or the call queue is full enough(e.g. the server side is slow/hang due to some issues), and if the client side has a small rpc timeout value, it could be possible when this request be taken from call queue, the client connection is lost in that moment. we'd better has some fail-fast code before the reall call be invoked, it just waste the server side resource. Here is a strace trace from our production env: 2014-02-11,18:16:19,525 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call get([B@3eae6c77, {timeRange:[0,9223372036854775807],totalColumns:1,cacheBlocks:true,families:{X:[T]},maxVersions:1,row:0741031-m8997060}), rpc version=1, client version=29, methodsFingerPrint=-241105381 from 10.101.10.181:43252: output error 2014-02-11,18:16:19,526 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 151 on 12600 caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null 2014-02-11,18:16:19,797 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call get([B@3f10ffd2, {timeRange:[0,9223372036854775807],totalColumns:1,cacheBlocks:true,families:{X:[T]},maxVersions:1,row:4245978-m7281526}), rpc version=1, client version=29, methodsFingerPrint=-241105381 from 10.101.10.181:43259 after 0 ms, since caller disconnected at org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:450) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3633) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3590) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3615) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4414) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4387) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2075) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.SecureRpcEngine$Server.call(SecureRpcEngine.java:460) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1457) 2014-02-11,18:16:19,802 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call get([B@3f10ffd2, {timeRange:[0,9223372036854775807],totalColumns:1,cacheBlocks:true,families:{X:[T]},maxVersions:1,row:4245978-m7281526}), rpc version=1, client version=29, methodsFingerPrint=-241105381 from 10.101.10.181:43259: output error 2014-02-11,18:16:19,802 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 46 on 12600 caught a ClosedChannelException, this means that the server was processing a request but the client went away. The error message was: null With this fix, we can reduce this hit probability at least:) the upstream hadoop has this checking already, see: https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java#L2034-L2036 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Patch Available (was: Open) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10497) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically
[ https://issues.apache.org/jira/browse/HBASE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899374#comment-13899374 ] Nicolas Liochon commented on HBASE-10497: - bq. The helper method Threads.sleep is implemented as below: Wow. We should just remove Threads.sleep: it's a sure way to shoot ourselves in the foot... (it doesn't have to be done in this patch) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically -- Key: HBASE-10497 URL: https://issues.apache.org/jira/browse/HBASE-10497 Project: HBase Issue Type: Improvement Components: Client, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Attachments: HBASE-10497-trunk_v1.patch, HBASE-10497-trunk_v2.patch There are many places where InterruptedException thrown by Thread.sleep are swallowed silently (which are neither declared in the caller method's throws clause nor rethrown immediately) under HBase-Client/HBase-Server folders. It'd be better to add standard 'log and call currentThread.interrupt' for such cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10497) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically
[ https://issues.apache.org/jira/browse/HBASE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10497: Status: Patch Available (was: Open) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically -- Key: HBASE-10497 URL: https://issues.apache.org/jira/browse/HBASE-10497 Project: HBase Issue Type: Improvement Components: Client, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Attachments: HBASE-10497-trunk_v1.patch, HBASE-10497-trunk_v2.patch There are many places where InterruptedException thrown by Thread.sleep are swallowed silently (which are neither declared in the caller method's throws clause nor rethrown immediately) under HBase-Client/HBase-Server folders. It'd be better to add standard 'log and call currentThread.interrupt' for such cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10497) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically
[ https://issues.apache.org/jira/browse/HBASE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899422#comment-13899422 ] Nicolas Liochon commented on HBASE-10497: - I've submitted the patch, to see what hadoop-qa says. The patch is big. It touches some scaring part. I would propose to split it: - some parts are in the mini cluster: even if we do something wrong here it's not a big deal - some parts seems good: RpcClient (I can +1 this) - some parts are more complex: MemStoreFlusher: I'm not sure. It's a runnable, restoring the status seems meaningless; SplitLogWorker seems wrong, it's a runnable ;Leases. I don't know. Could be wrong again; HRegionServer - some parts, imho, should not be managed by restoring the status but by throwing an exception or even explaining why we do nothing. That's the case here. {code} public TableListModel getTableList() throws IOException { StringBuilder path = new StringBuilder(); path.append('/'); if (accessToken != null) { path.append(accessToken); path.append('/'); } int code = 0; for (int i = 0; i maxRetries; i++) { // Response response = client.get(path.toString(), // Constants.MIMETYPE_XML); Response response = client.get(path.toString(), Constants.MIMETYPE_PROTOBUF); code = response.getCode(); switch (code) { case 200: TableListModel t = new TableListModel(); return (TableListModel) t.getObjectFromMessage(response.getBody()); case 404: throw new IOException(Table list not found); case 509: try { Thread.sleep(sleepTime); } catch (InterruptedException e) { thrown new InterruptedIOException(); } break; default: throw new IOException(get request to + path.toString() + request returned + code); } } throw new IOException(get request to + path.toString() + request timed out); } {code} So at the end, we could have a few sub jiras: - the ones with a high level of confidence - the less confident ones bug not very risky - the ones when we don't restore the status but we properly manage the exception - a set of small ones per area for the others. They will be easier to review. Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically -- Key: HBASE-10497 URL: https://issues.apache.org/jira/browse/HBASE-10497 Project: HBase Issue Type: Improvement Components: Client, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Attachments: HBASE-10497-trunk_v1.patch, HBASE-10497-trunk_v2.patch There are many places where InterruptedException thrown by Thread.sleep are swallowed silently (which are neither declared in the caller method's throws clause nor rethrown immediately) under HBase-Client/HBase-Server folders. It'd be better to add standard 'log and call currentThread.interrupt' for such cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899463#comment-13899463 ] Nicolas Liochon commented on HBASE-10490: - I've have 3 +1, but it could more on the idea of removing the ping. If the hadoop-qa is fine, I plan to commit this tomorrow, please tell me if you want to do a deeper review and need more time. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899463#comment-13899463 ] Nicolas Liochon edited comment on HBASE-10490 at 2/12/14 7:24 PM: -- I have 3 +1, but it could more on the idea of removing the ping. If the hadoop-qa is fine, I plan to commit this tomorrow, please tell me if you want to do a deeper review and need more time. was (Author: nkeywal): I've have 3 +1, but it could more on the idea of removing the ping. If the hadoop-qa is fine, I plan to commit this tomorrow, please tell me if you want to do a deeper review and need more time. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10354) Add an API for defining consistency per request
[ https://issues.apache.org/jira/browse/HBASE-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10354: Attachment: 10354.add.patch Add an API for defining consistency per request --- Key: HBASE-10354 URL: https://issues.apache.org/jira/browse/HBASE-10354 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.99.0, hbase-10070 Attachments: 10354.add.patch, hbase-10354_v1.patch, hbase-10354_v2.patch We should add an API to be able to define the expected consistency level per operation. This API should also allow to check whether the results coming from a query (get or scan) is stale or not. The defaults should reflect the current semantics. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10354) Add an API for defining consistency per request
[ https://issues.apache.org/jira/browse/HBASE-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899470#comment-13899470 ] Nicolas Liochon commented on HBASE-10354: - An addendum for the serialization of Result. Add an API for defining consistency per request --- Key: HBASE-10354 URL: https://issues.apache.org/jira/browse/HBASE-10354 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.99.0, hbase-10070 Attachments: 10354.add.patch, hbase-10354_v1.patch, hbase-10354_v2.patch We should add an API to be able to define the expected consistency level per operation. This API should also allow to check whether the results coming from a query (get or scan) is stale or not. The defaults should reflect the current semantics. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
Nicolas Liochon created HBASE-10511: --- Summary: Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10511: Status: Patch Available (was: Open) Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10511: Attachment: 10511.v1.patch Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899505#comment-13899505 ] Nicolas Liochon commented on HBASE-10511: - bq. I will have kept the LOG.info(randomRead latency log (ms): + Arrays.toString(times)); If you have a lot of measure, it makes the output terrible. We could add another option, but it seems overkill for something which is not 'core'. If someone wants to do it I'm fine, but for myself I would love to keep it as simple as possible... Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10511) Add latency percentiles on PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-10511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899681#comment-13899681 ] Nicolas Liochon commented on HBASE-10511: - Any issue if I commit it like it is? Add latency percentiles on PerformanceEvaluation Key: HBASE-10511 URL: https://issues.apache.org/jira/browse/HBASE-10511 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10511.v1.patch The latency is reported as an array or float. It's easier to deal with percentiles :-) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Open (was: Patch Available) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Attachment: 10490.v3.patch Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Patch Available (was: Open) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899689#comment-13899689 ] Nicolas Liochon commented on HBASE-10490: - The issue raised by hadoop-qa was real. the lock order is Connection instance - DataOutputStream. Doing it in the other side leads to dead locks.. Fixed. I put the calls counter in an atomic counter as well. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899826#comment-13899826 ] Nicolas Liochon commented on HBASE-10490: - The warnings are false positives that can be fixed by with the ad hoc findbugs setting. I will do that on commit/ Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10355: Attachment: 10355.v1.patch Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10355) Failover RPC's from client using region replicas
[ https://issues.apache.org/jira/browse/HBASE-10355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899917#comment-13899917 ] Nicolas Liochon commented on HBASE-10355: - the patch relies on: - basically everything else in the 10070 scope - interruption like 10472 - cleanup like 10490 - and another patch around interruption in RpcClient that I will submit later on. Failover RPC's from client using region replicas Key: HBASE-10355 URL: https://issues.apache.org/jira/browse/HBASE-10355 Project: HBase Issue Type: Sub-task Components: Client Reporter: Enis Soztutar Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10355.v1.patch -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13899966#comment-13899966 ] Nicolas Liochon commented on HBASE-10490: - It's a typo. It's interesting that it works actually. Thanks for catching this. I will fix this on commit. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch, 10490.v2.patch, 10490.v3.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897660#comment-13897660 ] Nicolas Liochon commented on HBASE-10490: - Nice catch. The max idle time is actually used on the server as well. But it does not really work, because the default are different: server: maxIdleTime = 2*conf.getInt(ipc.client.connection.maxidletime, 1000); client: maxIdleTime = conf.getInt(hbase.ipc.client.connection.maxidletime, 1); //10s So it means that the server disconnects any client that have not spoken for 2 seconds; while the client pings every 10 seconds. not as well that one is prefixed by 'hbase.' while the other is not. In 2008 they were sharing the same name then it diverged. I suppose the code on the server doesn't do much because of this: {code} protected int thresholdIdleConnections; // the number of idle // connections after which we // will start cleaning up idle // connections {code} thresholdIdleConnections is defaulted to 4000. Likely it never happens, and if it was happening it would not work because of the difference in default values. I suppose the best way of doing this is: order the connection not used for at least x seconds, kills some of the oldest. But, we can say as well that if we're satisfied with the way the server behaves today, we can remove the ping on the client without changing anything in the server: the behavior won't change. Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10497) Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically
[ https://issues.apache.org/jira/browse/HBASE-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13897708#comment-13897708 ] Nicolas Liochon commented on HBASE-10497: - I'm not sure of this one. {code} +++ hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSyncUp.java (working copy) @@ -111,6 +111,7 @@ } } catch (InterruptedException e) { System.err.println(didn't wait long enough: + e); + Thread.currentThread().interrupt(); return (-1); } {code} Because we took care of the interruption already (may be wrongly however), so the thread is not interrupted any more: the -1 means: we stop. It's the same for the one with the split worker likely. This one is likely wrong: {code} @@ -185,6 +185,8 @@ try { Thread.sleep(100); } catch (InterruptedException ignored) { +LOG.warn(Interrupted while sleeping); +Thread.currentThread().interrupt(); } if (System.currentTimeMillis() startTime + 3) { throw new RuntimeException(Master not active after 30 seconds); {code} As it's inside a loop, we're likely to loop again, then the sleep will be interrupted immediately as we restored the interruption status, then we will log again = we will flood the logs. I haven't checked the whole patch, but inside a loop you can't simply restore the status, you need to take a decision (stop the loop) or store the interruption to restore the status later. When we can, it's better to take care of the interruption explicitly by stopping our process and/or rethrowing an exception to the caller. In the case above, may be be we should throw a runtime exception, as in Master not active after 30 seconds? See as well ExceptionUtils (it's new). Add standard handling for swallowed InterruptedException thrown by Thread.sleep under HBase-Client/HBase-Server folders systematically -- Key: HBASE-10497 URL: https://issues.apache.org/jira/browse/HBASE-10497 Project: HBase Issue Type: Improvement Components: Client, regionserver Reporter: Feng Honghua Assignee: Feng Honghua Priority: Minor Attachments: HBASE-10497-trunk_v1.patch There are many places where InterruptedException thrown by Thread.sleep are swallowed silently (which are neither declared in the caller method's throws clause nor rethrown immediately) under HBase-Client/HBase-Server folders. It'd be better to add standard 'log and call currentThread.interrupt' for such cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Attachment: 10490.v1.patch Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10490: Status: Patch Available (was: Open) Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10490) Simplify RpcClient code
Nicolas Liochon created HBASE-10490: --- Summary: Simplify RpcClient code Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10490) Simplify RpcClient code
[ https://issues.apache.org/jira/browse/HBASE-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13896802#comment-13896802 ] Nicolas Liochon commented on HBASE-10490: - bq. Spelling minIddleTimeBeforeClose Sure. bq. i.e. keeping around the exception We're clean now: we now use it only in logs and as initCause in the exceptions we throw. So we can as well fully remove it (basically, between the boolean and the exception, I kept the exception, but keeping only the boolean should be fine as well). Simplify RpcClient code --- Key: HBASE-10490 URL: https://issues.apache.org/jira/browse/HBASE-10490 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10490.v1.patch The code is complex. Here is a set of proposed changes, for trunk: 1) remove PingInputStream. if rpcTimeout 0 it just rethrows the exception. I expect that we always have a rpcTimeout. So we can remove the code. 2) remove the sendPing: instead, just close the connection if it's not used for a while, instead of trying to ping the server. 3) remove maxIddle time: to avoid the confusion if someone has overwritten the conf. 4) remove shouldCloseConnection: it was more or less synchronized with closeException. Having a single variable instead of two avoids the synchro 5) remove lastActivity: instead of trying to have an exact timeout, just kill the connection after some time. lastActivity could be set to wrong values if the server was slow to answer. 6) hopefully, a better management of the exception; we don't use the close exception of someone else as an input for another one. Same goes for interruption. I may have something wrong in the code. I will review it myself again. Feedback welcome, especially on the ping removal: I hope I got all the use cases. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894331#comment-13894331 ] Nicolas Liochon commented on HBASE-10472: - bq. I guess we need this for checking the meta location from zk right? Yes. bq. Do we need that to be interruptible as well? If the API is synchronous, it needs to be interruptible... bq. I guess we might need it if we ever want to go meta replicas. Actually, we need it already, for example because a split task can be interrupted. Some HBase code relies on being interruptible already. With replicas, or without replica, you may want to rely on this to free the handlers (for example, if we get stuck on a region, we may use all the handlers of the region server today: if the client cancels the query interrupting the call does help). bq. We are not setting the interrupted flag for the thread in ReplicationPeersZKImpl, ZKLeaderManager If we throw an exception, it means that we took care of the interruption, so we don't need to reset the interrupt flag (and should not). If we abort the server, we can consider that we took care of the interruption. Keeping the thread interrupted could mean interrupting the abort, which may not be what was expected. bq. Can we get away with some of the changes, if we keep the original ZKUtil.getData() and have ZKUtil.getDataInterruptible() ? Not really, imho: we need the code to be clean when there are interruptions. ZooKeeper API is clean. Our wrapping is not clean today. Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Status: Open (was: Patch Available) HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Status: Patch Available (was: Open) HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Attachment: 10185.v2.patch HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13894391#comment-13894391 ] Nicolas Liochon commented on HBASE-10185: - v2 cleans up the way we manage the 'calls' list. synchronization should be as today everywhere. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895011#comment-13895011 ] Nicolas Liochon commented on HBASE-10472: - bq. we can make ZkUtil methods interruptible one by one in the long term. Yep. Thanks for the review, I've committed to the .99. Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10472: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10472: Fix Version/s: (was: 0.98.1) Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895140#comment-13895140 ] Nicolas Liochon commented on HBASE-10185: - There is actually an issue with the v2 (unrelated to the precommit issue): in waitForWork, we have some special conditions on the calls.empty(), and we can now trigger them. I will fix this. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch, 10185.v2.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893195#comment-13893195 ] Nicolas Liochon commented on HBASE-10185: - Yes, if it's an issue in trunk, we should solve it at least there. In addCall, we reusing the exception accross calls, so other calls may receive 'DoNotRetry'. We can't take a decision based on the exception that caused the connection to be closed, as it's shared between the calls. Connection closed should come from connection issue, i.e. we don't expect a DoNotRetryIOException, as a connection error will be retried all the time. May be the issue is that an error (or some errors?) in Writable.write should not close the connection... Or that Writable.write should just not send a DoNotRetryIOException: it's an application level exception, difficult to manage in the network connectivity layer... I suppose that the 0.96+ cannot have this, as the serialization is done with protobuf. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12 Reporter: Samarth Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10472: Status: Open (was: Patch Available) Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10472: Attachment: 10472.v2.patch Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10472) Manage the interruption in ZKUtil#getData
[ https://issues.apache.org/jira/browse/HBASE-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10472: Status: Patch Available (was: Open) v2 fixes the findbugs warnings. Manage the interruption in ZKUtil#getData - Key: HBASE-10472 URL: https://issues.apache.org/jira/browse/HBASE-10472 Project: HBase Issue Type: Bug Components: Client, master, regionserver Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Fix For: 0.98.1, 0.99.0 Attachments: 10472.v1.patch, 10472.v2.patch Many, but not all, methods in ZKUtil partly hides the interruption: they return null or something like that. Many times, this will result in a NPE, or something undefined. This jira is limited to the getData to keep things small enough (it's used in hbase-client code). The code is supposed to behave at least 'as well as before', or better (hopefully). It could be included in a later .98 release (98.1) and in .99 -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HBASE-10477) Regression from HBASE-10337
Nicolas Liochon created HBASE-10477: --- Summary: Regression from HBASE-10337 Key: HBASE-10477 URL: https://issues.apache.org/jira/browse/HBASE-10477 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Fix For: 0.98.0, 0.99.0 a piece of code that should not have make it... ping [~andrew.purt...@gmail.com] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10477) Regression from HBASE-10337
[ https://issues.apache.org/jira/browse/HBASE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10477: Attachment: 10477.v1.patch Regression from HBASE-10337 --- Key: HBASE-10477 URL: https://issues.apache.org/jira/browse/HBASE-10477 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Fix For: 0.98.0, 0.99.0 Attachments: 10477.v1.patch a piece of code that should not have make it... ping [~andrew.purt...@gmail.com] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10477) Regression from HBASE-10337
[ https://issues.apache.org/jira/browse/HBASE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10477: Status: Patch Available (was: Open) Regression from HBASE-10337 --- Key: HBASE-10477 URL: https://issues.apache.org/jira/browse/HBASE-10477 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Fix For: 0.98.0, 0.99.0 Attachments: 10477.v1.patch a piece of code that should not have make it... ping [~andrew.purt...@gmail.com] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893409#comment-13893409 ] Nicolas Liochon commented on HBASE-10185: - looking more at the code, it seems that writeRequest could send an exception if it fails during the write. Then this exception could be thrown by the caller, w/o risking to send an exception created by somebody else. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12 Reporter: Samarth Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893452#comment-13893452 ] Nicolas Liochon commented on HBASE-10185: - I'm not far from jira hijacking here ;-) I'm reopening to submit the patch on trunk. I expect the code to be close on .94 actually. We've got something a little bit similar on trunk, because of the codec: we can imagine a specific codec that cannot handle some kind of messages. Then it will need to send a doNotRetry, and we should not need to close the connection, as the tcp connection itself is fine. HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Fix Version/s: 0.99.0 Affects Version/s: 0.99.0 Status: Patch Available (was: Reopened) HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10185: Attachment: 10185.v1.patch HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Reopened] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon reopened HBASE-10185: - HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Fix For: 0.99.0 Attachments: 10185.v1.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HBASE-10477) Regression from HBASE-10337
[ https://issues.apache.org/jira/browse/HBASE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon updated HBASE-10477: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Regression from HBASE-10337 --- Key: HBASE-10477 URL: https://issues.apache.org/jira/browse/HBASE-10477 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Fix For: 0.98.0, 0.99.0 Attachments: 10477.v1.patch a piece of code that should not have make it... ping [~andrew.purt...@gmail.com] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HBASE-10477) Regression from HBASE-10337
[ https://issues.apache.org/jira/browse/HBASE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13893560#comment-13893560 ] Nicolas Liochon commented on HBASE-10477: - Committed, hopefully on time for the current .98 RC. Thanks for the quick review, Nick Regression from HBASE-10337 --- Key: HBASE-10477 URL: https://issues.apache.org/jira/browse/HBASE-10477 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.98.0, 0.99.0 Reporter: Nicolas Liochon Assignee: Nicolas Liochon Priority: Critical Fix For: 0.98.0, 0.99.0 Attachments: 10477.v1.patch a piece of code that should not have make it... ping [~andrew.purt...@gmail.com] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (HBASE-10185) HBaseClient retries even though a DoNotRetryException was thrown
[ https://issues.apache.org/jira/browse/HBASE-10185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Liochon reassigned HBASE-10185: --- Assignee: Nicolas Liochon HBaseClient retries even though a DoNotRetryException was thrown Key: HBASE-10185 URL: https://issues.apache.org/jira/browse/HBASE-10185 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.94.12, 0.99.0 Reporter: Samarth Assignee: Nicolas Liochon Fix For: 0.99.0 Attachments: 10185.v1.patch Throwing a DoNotRetryIOException inside Writable.write(Dataoutput) method doesn't prevent HBase from retrying. Debugging the code locally, I figured that the bug lies in the way HBaseClient simply throws an IOException when it sees that a connection has been closed unexpectedly. Method: public Writable call(Writable param, InetSocketAddress addr, Class? extends VersionedProtocol protocol, User ticket, int rpcTimeout) Excerpt of code where the bug is present: while (!call.done) { if (connection.shouldCloseConnection.get()) { throw new IOException(Unexpected closed connection); } Throwing this IOException causes the ServerCallable.translateException(t) to be a no-op resulting in HBase retrying. From my limited view and understanding of the code, one way I could think of handling this is by looking at the closeConnection member variable of a connection to determine what kind of exception should be thrown. Specifically, when a connection is closed, the current code does this: protected synchronized void markClosed(IOException e) { if (shouldCloseConnection.compareAndSet(false, true)) { closeException = e; notifyAll(); } } Within HBaseClient's call method, the code could possibly be modified to: while (!call.done) { if (connection.shouldCloseConnection.get() ) { if(connection.closeException instanceof DoNotRetryIOException) { throw closeException; } throw new IOException(Unexpected closed connection); } -- This message was sent by Atlassian JIRA (v6.1.5#6160)