[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158891#comment-14158891 ] Hudson commented on HBASE-12166: FAILURE: Integrated in HBase-TRUNK #5619 (See [https://builds.apache.org/job/HBase-TRUNK/5619/]) HBASE-12166 TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork (jxiang: rev ef35182f1fcb8e23566e66211f78eb5200168122) * hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestRegionReplicaReplicationEndpoint.java * hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zook
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158839#comment-14158839 ] Hudson commented on HBASE-12166: FAILURE: Integrated in HBase-1.0 #273 (See [https://builds.apache.org/job/HBase-1.0/273/]) HBASE-12166 TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork (jxiang: rev 12fd6d2a2495533726d84759fd6cc754b5f7eb97) * hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a433
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158762#comment-14158762 ] stack commented on HBASE-12166: --- +1 > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would seem to indicate that w
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158760#comment-14158760 ] Hadoop QA commented on HBASE-12166: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672876/hbase-12166_v2.patch against trunk revision . ATTACHMENT ID: 12672876 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: {color:red}-1 core zombie tests{color}. There are 2 zombie test(s): at org.apache.hadoop.hbase.io.crypto.TestCipherProvider.testCustomProvider(TestCipherProvider.java:132) at org.apache.karaf.main.MainStartTest.testStopWithTimeout(MainStartTest.java:94) Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11214//console This message is automatically generated. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbas
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158679#comment-14158679 ] Jimmy Xiang commented on HBASE-12166: - [~stack], good catch! Unbeliveable! > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would see
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158673#comment-14158673 ] stack commented on HBASE-12166: --- There is another bomb in that same class [~jxiang] > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158667#comment-14158667 ] Jeffrey Zhong commented on HBASE-12166: --- [~jxiang]Good catch! Looks good to me(+1). Better change the variable name listSize2 to tmpFailedServerSizse though. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode del
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158663#comment-14158663 ] stack commented on HBASE-12166: --- [~jxiang] Yeah, thats it. I just ran into it (Didn't believe it...). Test passed for me when I made the change, +1 and +1 to upping timeout (Am checking other uses of 'listSize' -- smile). > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/rec
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158649#comment-14158649 ] Jimmy Xiang commented on HBASE-12166: - TestRegionReplicaReplicationEndpoint is ok locally. I can increase the timeout a little at checkin (from 1000 to 6000?). > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode del
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158645#comment-14158645 ] Jimmy Xiang commented on HBASE-12166: - [~stack], [~jeffreyz], could you take a look the patch? Thanks. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 complete
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158641#comment-14158641 ] Jimmy Xiang commented on HBASE-12166: - TestMasterObserver should be fixed by the addendumo of HBASE-12167. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 comp
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158621#comment-14158621 ] Hadoop QA commented on HBASE-12166: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672838/hbase-12166.patch against trunk revision . ATTACHMENT ID: 12672838 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestMasterObserver org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/11211//console This message is automatically generated. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getSc
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158515#comment-14158515 ] Jimmy Xiang commented on HBASE-12166: - I think I found out the cause. In ZKSplitLogManagerCoordination#removeRecoveringRegions: {noformat} listSize = failedServers.size(); for (int j = 0; j < listSize; j++) { {noformat} The listSize is redefined. That's not a bug, it is a hidden bomb :) > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thre
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158455#comment-14158455 ] stack commented on HBASE-12166: --- The problem is that the meta region has no edits in it so when we list the filesystem to find crashed servers, we see this: 476 2014-10-03 10:21:42,470 INFO [ActiveMasterManager] master.ServerManager(918): Finished waiting for region servers count to settle; checked in 6, slept for 1012 ms, expecting minimum of 5, maximum of 6, master is running 477 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61565,1412356895892 belongs to an existing region server 478 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(249): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952 doesn't belong to a known region server, splitting 479 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61576,1412356896091 belongs to an existing region server 480 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61579,1412356896131 belongs to an existing region server 481 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61582,1412356896169 belongs to an existing region server 482 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61586,1412356896205 belongs to an existing region server 483 2014-10-03 10:21:42,471 INFO [ActiveMasterManager] master.MasterFileSystem(253): Log folder hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61591,1412356896245 belongs to an existing region server i.e. all servers but the dead master which in this case is localhost,61562,1412356895859 When we go to online hbase:meta, it complains that there is no WAL file: 501 2014-10-03 10:21:42,478 INFO [ActiveMasterManager] master.MasterFileSystem(325): Log dir for server localhost,61562,1412356895859 does not exist 502 2014-10-03 10:21:42,479 DEBUG [ActiveMasterManager] master.MasterFileSystem(323): Renamed region directory: hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting 503 2014-10-03 10:21:42,479 INFO [ActiveMasterManager] master.SplitLogManager(536): dead splitlog workers [localhost,61562,1412356895859, localhost,61572,1412356895952] 504 2014-10-03 10:21:42,480 INFO [ActiveMasterManager] master.SplitLogManager(172): hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting is empty dir, no logs to split 505 2014-10-03 10:21:42,480 DEBUG [ActiveMasterManager] master.SplitLogManager(235): Scheduling batch of logs to split 506 2014-10-03 10:21:42,480 INFO [ActiveMasterManager] master.SplitLogManager(237): started splitting 0 logs in [hdfs://localhost:58772/user/stack/hbase/WALs/localhost,61572,1412356895952-splitting] > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hado
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158375#comment-14158375 ] Jimmy Xiang commented on HBASE-12166: - You are right. Not hosting namespace on master can solve the issue. Your fix is fine with me. I'd like to look into it further to find out the root. Thanks. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 zno
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158340#comment-14158340 ] stack commented on HBASE-12166: --- [~jxiang] A simple 'fix' would be to not host meta:namespace on master? I don't mind finishing this one (I can make it fail reliably). Was just looking for input on how you'd like it solved. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe26
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158319#comment-14158319 ] Jimmy Xiang commented on HBASE-12166: - Good investigation. Table namespace is handled just the same as any other user tables. Let me take look. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158275#comment-14158275 ] stack commented on HBASE-12166: --- [~jxiang] Problem here is that master is carrying hbase:namespace but when recovering we act as though it only is hosting hbase:meta. We mark hbase:meta as recovering and do its log splitting. For hbase:namespace, we find it along w/ other regions and mark it as recovering only because it was on the master -- and master no longer has associated WALs because of above meta processing -- it just stays stuck in recovering mode. If you have suggestion I'm all ears else I'll hack something in. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.Reco