[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108303#comment-14108303 ]
Hadoop QA commented on HBASE-11813: ----------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663910/11813.master.txt against trunk revision . ATTACHMENT ID: 12663910 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 6 warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 1 zombie test(s): at org.apache.hadoop.hbase.http.TestHttpServerLifecycle.testStartedServerWithRequestLog(TestHttpServerLifecycle.java:92) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10550//console This message is automatically generated. > CellScanner#advance may infinitely recurse > ------------------------------------------ > > Key: HBASE-11813 > URL: https://issues.apache.org/jira/browse/HBASE-11813 > Project: HBase > Issue Type: Bug > Reporter: Andrew Purtell > Assignee: stack > Priority: Blocker > Fix For: 0.99.0, 2.0.0, 0.98.6 > > Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt > > > On user@hbase, johannes.schab...@visual-meta.com reported: > {quote} > we face a serious issue with our HBase production cluster for two days now. > Every couple minutes, a random RegionServer gets stuck and does not process > any requests. In addition this causes the other RegionServers to freeze > within a minute which brings down the entire cluster. Stopping the affected > RegionServer unblocks the cluster and everything comes back to normal. > {quote} > Subsequent troubleshooting reveals that RPC is getting stuck because we are > losing RPC handlers. In the .out files we have this: > {noformat} > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) > [...] > Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020" > java.lang.StackOverflowError > Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020" > java.lang.StackOverflowError​ > {noformat} > That is the anonymous CellScanner instance we create from > CellUtil#createCellScanner: > {code} > ​ return new CellScanner() { > private final Iterator<? extends CellScannable> iterator = > cellScannerables.iterator(); > private CellScanner cellScanner = null; > @Override > public Cell current() { > return this.cellScanner != null? this.cellScanner.current(): null; > } > @Override > public boolean advance() throws IOException { > if (this.cellScanner == null) { > if (!this.iterator.hasNext()) return false; > this.cellScanner = this.iterator.next().cellScanner(); > } > if (this.cellScanner.advance()) return true; > this.cellScanner = null; > ---> return advance(); > } > }; > {code} > That final return statement is the immediate problem. > We should also fix this so the RegionServer aborts if it loses a handler to > an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)