[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109410#comment-14109410 ] stack commented on HBASE-11813: --- javadoc is unrelated. I can fix on commit though: [WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java:96: warning - Tag @link: reference not found: cellCodecClsName > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, 11813v2.master.txt, 11813v3.master.txt, > catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109407#comment-14109407 ] stack commented on HBASE-11813: --- Review? > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, 11813v2.master.txt, 11813v3.master.txt, > catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109309#comment-14109309 ] Hadoop QA commented on HBASE-11813: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664157/11813v3.master.txt against trunk revision . ATTACHMENT ID: 12664157 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10566//console This message is automatically generated. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, 11813v2.master.txt, 11813v3.master.txt, > catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109187#comment-14109187 ] Hadoop QA commented on HBASE-11813: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664156/11813v2.master.txt against trunk revision . ATTACHMENT ID: 12664156 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:red}-1 Anti-pattern{color}. The patch appears to have anti-pattern where BYTES_COMPARATOR was omitted: +NavigableMap> m = new TreeMap>();. {color:red}-1 javac{color}. The patch appears to cause mvn compile goal to fail. Compilation errors resume: [ERROR] COMPILATION ERROR : [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[79,10] error: TestCellUtil.TestCell is not abstract and does not override abstract method getTagsLength() in Cell [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[186,17] error: getTagsLength() in TestCellUtil.TestCell cannot implement getTagsLength() in Cell [ERROR] return type short is not compatible with int [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[191,4] error: method does not override or implement a method from a supertype [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:testCompile (default-testCompile) on project hbase-common: Compilation failure: Compilation failure: [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[79,10] error: TestCellUtil.TestCell is not abstract and does not override abstract method getTagsLength() in Cell [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[186,17] error: getTagsLength() in TestCellUtil.TestCell cannot implement getTagsLength() in Cell [ERROR] return type short is not compatible with int [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[185,4] error: method does not override or implement a method from a supertype [ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/test/java/org/apache/hadoop/hbase/TestCellUtil.java:[191,4] error: method does not override or implement a method from a supertype [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :hbase-common Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10565//console This message is automatically generated. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, 11813v2.master.txt, catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apach
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109055#comment-14109055 ] Johannes Schaback commented on HBASE-11813: --- Quick update from me. We have Stacks patch running for a day now. The StackOverflowExceptions did not occur again, all RS are operational and the cluster did not hang. We keep our home-compiled HBase running now until the patch makes it to an official release. Our client code still queries very large batches at times. Random-access of 100k records in one query is likely for us. Large batches were the original cause of this issue. With the recursion issue resolved, we now observed two non-dramatic cases where the client timed out and a ChannelClosedException was thrown on the server side without killing the RS. Stack and I suspect that a large query is taking to long to process/transmit, but we havent figured out the root cause yet (region is consistent). We will adjust our logging a bit and keep an eye on it. Besides these two cases, nothing happend so far. Thank you all for the quick fix and the responsiveness! > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108755#comment-14108755 ] ramkrishna.s.vasudevan commented on HBASE-11813: bq.-1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.TestCellUtil Seems the test case is related to the patch. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108679#comment-14108679 ] Hadoop QA commented on HBASE-11813: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664071/11813.master.txt against trunk revision . ATTACHMENT ID: 12664071 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestCellUtil Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10557//console This message is automatically generated. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > 11813.master.txt, catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowE
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108647#comment-14108647 ] Hadoop QA commented on HBASE-11813: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664068/catch_all_exceptions.txt against trunk revision . ATTACHMENT ID: 12664068 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10556//console This message is automatically generated. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt, > catch_all_exceptions.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108443#comment-14108443 ] Andrew Purtell commented on HBASE-11813: Thanks for reporting back [~Schabby]. Please let us know if this is still looking good after a day or so. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108429#comment-14108429 ] Johannes Schaback commented on HBASE-11813: --- The patch is live in our production cluster for about 2 hours now. So far no RS crash... > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108391#comment-14108391 ] Johannes Schaback commented on HBASE-11813: --- Ah, nevermind. I believe I just have to apply the patched attached to this bug ticket. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108387#comment-14108387 ] Johannes Schaback commented on HBASE-11813: --- Sorry for my asking, but where do I get the patch from exactly? git://git.apache.org/hbase.git has the last commits about 15 hours ago. Thanks, Johannes > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108382#comment-14108382 ] Qiang Tian commented on HBASE-11813: oops..it already points to line 210(got fever,brain is not so clear) Thanks Stack > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108303#comment-14108303 ] Hadoop QA commented on HBASE-11813: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663910/11813.master.txt against trunk revision . ATTACHMENT ID: 12663910 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.http.TestHttpServerLifecycle.testStartedServerWithRequestLog(TestHttpServerLifecycle.java:92) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//console This message is automatically generated. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflow
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108300#comment-14108300 ] stack commented on HBASE-11813: --- [~Schabby] Suggest you enable DEBUG. This patch below should catch the overflow error, dump some detail on the particular invocation, and allow you keep going: diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java index 31484bb..da2afe0 100644 --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java @@ -136,9 +136,9 @@ public class CallRunner { "this means that the server was processing a " + "request but the client went away. The error message was: " + cce.getMessage()); -} catch (Exception e) { +} catch (Throwable e) { RpcServer.LOG.warn(Thread.currentThread().getName() - + ": caught: " + StringUtils.stringifyException(e)); + + ": caught: " + StringUtils.stringifyException(e) + " call=" + getCall()); } } No guarantees! I tried it and works when no problems. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: stack >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this s
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108251#comment-14108251 ] Qiang Tian commented on HBASE-11813: I'd suspect this one: {code} /** * Flatten the map of cells out under the CellScanner * @param map Map of Cell Lists; for example, the map of families to Cells that is used * inside Put, etc., keeping Cells organized by family. * @return CellScanner interface over cellIterable */ public static CellScanner createCellScanner(final NavigableMap> map) { return new CellScanner() { private final Iterator>> entries = map.entrySet().iterator(); private Iterator currentIterator = null; private Cell currentCell; @Override public Cell current() { return this.currentCell; } @Override public boolean advance() { if (this.currentIterator == null) { if (!this.entries.hasNext()) return false; this.currentIterator = this.entries.next().getValue().iterator(); } if (this.currentIterator.hasNext()) { this.currentCell = this.currentIterator.next(); return true; } this.currentCell = null; this.currentIterator = null; return advance(); } }; } {code} looks the one Andrew mentioned would not trigger advance method in server side...while the other one is widely used in server side code paths..coprocessor or end point related.. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108197#comment-14108197 ] Andrew Purtell commented on HBASE-11813: Not that a scanner that never returns is better but can we do this without recursion? > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108190#comment-14108190 ] stack commented on HBASE-11813: --- The code has been in hbase a good while now. The issue I think is this.cellScanner = this.iterator.next().cellScanner(); where the iterator never finishes. I cannot repro it locally. Its some particularly combo of cell count and lists of cellscanners that is triggering it. > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108143#comment-14108143 ] Johannes Schaback commented on HBASE-11813: --- Great, we are eagerly looking forward to the patch :) > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108112#comment-14108112 ] Andrew Purtell commented on HBASE-11813: Ping [~stack], this came in on HBASE-7899 > CellScanner#advance may infinitely recurse > -- > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > return new CellScanner() { > private final Iterator iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > --->return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)