[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177585#comment-13177585 ] Hadoop QA commented on HBASE-5100: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508916/5100-v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/639//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/639//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/639//console This message is automatically generated. Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING,
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Patch Available (was: Open) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Status: Open (was: Patch Available) use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5064: --- Attachment: 5064.v20.patch use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177603#comment-13177603 ] Hudson commented on HBASE-5099: --- Integrated in HBase-TRUNK #2593 (See [https://builds.apache.org/job/HBase-TRUNK/2593/]) HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5100: -- Attachment: 5100-double-exeception.txt Patch that covers runtime exception coming out of parent.close(false) Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177613#comment-13177613 ] Hadoop QA commented on HBASE-5064: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508924/5064.v20.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 16 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.master.TestDistributedLogSplitting Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/640//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/640//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/640//console This message is automatically generated. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177622#comment-13177622 ] Hadoop QA commented on HBASE-5100: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12508926/5100-double-exeception.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -151 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 76 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/641//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/641//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/641//console This message is automatically generated. Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177637#comment-13177637 ] nkeywal commented on HBASE-5064: the v20 is ok for commit imho. There are two processes by default, and 4 on hadoop-qa. It possible to change the number of processes used by specifying -Dsurefire.secondPartThreadCount=WhatYouWant on mvn command line. Using -Dsurefire.secondPartThreadCount=1 means no parallelization. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177644#comment-13177644 ] nkeywal commented on HBASE-4955: We're now using 2.12-TRUNK-HBASE-2. It's a private version, built on the 2.12 trunk (i.e.: it does not contain eveyrthing that will be in 2.12 final). Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 773 (forked processes not killed after timeout): not fixed in trunk, not fixed in our version 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) in trunk, partial fixed in our version. 760 (does not take into account the test method): fixed, we use the official implementation from the trunk 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): fixed in trunk, fixed in our version with some minimal differences. 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed in our version 806 (Ignore selection criteria when -Dtest= is specified): not fixed in trunk, not fixed in our version 813 (Randomly wrong tests count and empty summary files): fixed in trunk, fixed in our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5102) Change the default value of the property hbase.connection.per.config to false in hbase-default.xml
[ https://issues.apache.org/jira/browse/HBASE-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5102: -- Attachment: 5102.addendum Addendum that removes stale connection in HBaseAdmin ctor Change the default value of the property hbase.connection.per.config to false in hbase-default.xml - Key: HBASE-5102 URL: https://issues.apache.org/jira/browse/HBASE-5102 Project: HBase Issue Type: Improvement Reporter: ramkrishna.s.vasudevan Priority: Minor Fix For: 0.90.6 Attachments: 5102.addendum, HBASE-5102.patch The property hbase.connection.per.config has a default value of true in hbase-default.xml. In HConnectionManager we try to assign false as the default value if no value is specified. Better to make it uniform. As per Ted's suggestion making it false in the hbase-default.xml. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177683#comment-13177683 ] Zhihong Yu commented on HBASE-5100: --- See discussion 'detecting presence of exception inside finally block' on sea...@yahoogroups.com where I polled Java developers on my proposed formation. Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an
[jira] [Commented] (HBASE-5109) Fix TestAvroServer so that it waits properly for the modifyTable operation to complete
[ https://issues.apache.org/jira/browse/HBASE-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177695#comment-13177695 ] Zhihong Yu commented on HBASE-5109: --- @Ming: A loop was introduced by the following checkin: r1186531 | stack | 2011-10-19 15:05:37 -0700 (Wed, 19 Oct 2011) | 1 line HBASE-4621 TestAvroServer fails quite often intermittently Is your patch still needed ? TestAvroServer hasn't failed for quite a while. Fix TestAvroServer so that it waits properly for the modifyTable operation to complete -- Key: HBASE-5109 URL: https://issues.apache.org/jira/browse/HBASE-5109 Project: HBase Issue Type: Bug Components: test Reporter: Ming Ma Assignee: Ming Ma Attachments: HBASE-5109-0.92.patch TestAvroServer has the following issue impl.modifyTable(tableAname, tableA); // It can take a while for the change to take effect. Wait here a while. while(impl.describeTable(tableAname) == null ) { Threads.sleep(100); } assertTrue(impl.describeTable(tableAname).maxFileSize == 123456L); impl.describeTable(tableAname) returns the default maxSize 256M right away as modifyTable is async. Before HBASE-4328 is fixed, we can fix the test code to wait for say max of 5 seconds to check if impl.describeTable(tableAname).maxFileSize is uploaded to 123456L. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5099: --- Resolution: Fixed Status: Resolved (was: Patch Available) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5110) code enhancement - remove unnecessary if-checks in every loop in HLog class
code enhancement - remove unnecessary if-checks in every loop in HLog class --- Key: HBASE-5110 URL: https://issues.apache.org/jira/browse/HBASE-5110 Project: HBase Issue Type: Improvement Components: wal Affects Versions: 0.90.4, 0.90.2, 0.90.1, 0.92.0 Reporter: Mikael Sitruk Priority: Minor The HLog class (method findMemstoresWithEditsEqualOrOlderThan) has unnecessary if check in a loop. static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = null; for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { if (regions == null) regions = new ArrayListbyte [](); regions.add(e.getKey()); } } return regions == null? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } The following change is suggested static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = new ArrayListbyte [](); for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { regions.add(e.getKey()); } } return regions.size() == 0? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4397) -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time
[ https://issues.apache.org/jira/browse/HBASE-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177736#comment-13177736 ] Zhihong Yu commented on HBASE-4397: --- +1 on patch. -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time --- Key: HBASE-4397 URL: https://issues.apache.org/jira/browse/HBASE-4397 Project: HBase Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: HBASE-4397-0.92.patch 1. Shutdown all RSs. 2. Bring all RS back online. The -ROOT-, .META. stay in offline state until timeout monitor force assignment 30 minutes later. That is because HMaster can't find a RS to assign the tables to in assign operation. 011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to sea-lab-4,60020,1315870341387, trying to assign elsewhere instead; retry=0 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:345) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1002) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:854) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:148) at $Proxy9.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:407) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1408) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1153) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1128) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1123) at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1788) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:100) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRootWithRetries(ServerShutdownHandler.java:118) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:181) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT-,,0.70236052 Possible fixes: 1. Have serverManager handle server online event similar to how RegionServerTracker.java calls servermanager.expireServer in the case server goes down. 2. Make timeoutMonitor handle the situation better. This is a special situation in the cluster. 30 minutes timeout can be skipped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5110) code enhancement - remove unnecessary if-checks in every loop in HLog class
[ https://issues.apache.org/jira/browse/HBASE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177750#comment-13177750 ] Todd Lipcon commented on HBASE-5110: why? This isn't a hot code path... code enhancement - remove unnecessary if-checks in every loop in HLog class --- Key: HBASE-5110 URL: https://issues.apache.org/jira/browse/HBASE-5110 Project: HBase Issue Type: Improvement Components: wal Affects Versions: 0.90.1, 0.90.2, 0.90.4, 0.92.0 Reporter: Mikael Sitruk Priority: Minor The HLog class (method findMemstoresWithEditsEqualOrOlderThan) has unnecessary if check in a loop. static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = null; for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { if (regions == null) regions = new ArrayListbyte [](); regions.add(e.getKey()); } } return regions == null? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } The following change is suggested static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = new ArrayListbyte [](); for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { regions.add(e.getKey()); } } return regions.size() == 0? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4397) -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time
[ https://issues.apache.org/jira/browse/HBASE-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177756#comment-13177756 ] Lars Hofhansl commented on HBASE-4397: -- Nice find and patch... +1 (As a sidenote... Do we have to rethink this entire ROOT and META huh hah? There isn't a week going by without some new bug about races between splitting and assignment, or the master being stuck assigning ROOT/META, or similar cases. There are too many players that need to be kept in synch: The FS, ROOT/META, Zookeekper). -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time --- Key: HBASE-4397 URL: https://issues.apache.org/jira/browse/HBASE-4397 Project: HBase Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: HBASE-4397-0.92.patch 1. Shutdown all RSs. 2. Bring all RS back online. The -ROOT-, .META. stay in offline state until timeout monitor force assignment 30 minutes later. That is because HMaster can't find a RS to assign the tables to in assign operation. 011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to sea-lab-4,60020,1315870341387, trying to assign elsewhere instead; retry=0 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:345) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1002) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:854) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:148) at $Proxy9.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:407) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1408) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1153) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1128) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1123) at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1788) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:100) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRootWithRetries(ServerShutdownHandler.java:118) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:181) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT-,,0.70236052 Possible fixes: 1. Have serverManager handle server online event similar to how RegionServerTracker.java calls servermanager.expireServer in the case server goes down. 2. Make timeoutMonitor handle the situation better. This is a special situation in the cluster. 30 minutes timeout can be skipped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4397) -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time
[ https://issues.apache.org/jira/browse/HBASE-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-4397: -- Status: Patch Available (was: Open) -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time --- Key: HBASE-4397 URL: https://issues.apache.org/jira/browse/HBASE-4397 Project: HBase Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Attachments: HBASE-4397-0.92.patch 1. Shutdown all RSs. 2. Bring all RS back online. The -ROOT-, .META. stay in offline state until timeout monitor force assignment 30 minutes later. That is because HMaster can't find a RS to assign the tables to in assign operation. 011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to sea-lab-4,60020,1315870341387, trying to assign elsewhere instead; retry=0 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:345) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1002) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:854) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:148) at $Proxy9.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:407) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1408) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1153) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1128) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1123) at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1788) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:100) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRootWithRetries(ServerShutdownHandler.java:118) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:181) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT-,,0.70236052 Possible fixes: 1. Have serverManager handle server online event similar to how RegionServerTracker.java calls servermanager.expireServer in the case server goes down. 2. Make timeoutMonitor handle the situation better. This is a special situation in the cluster. 30 minutes timeout can be skipped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177785#comment-13177785 ] stack commented on HBASE-5100: -- Whats happening now in this issue? There is a v2. Is that now the candidate fix? Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for
[jira] [Commented] (HBASE-5110) code enhancement - remove unnecessary if-checks in every loop in HLog class
[ https://issues.apache.org/jira/browse/HBASE-5110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177841#comment-13177841 ] Todd Lipcon commented on HBASE-5110: Ah, I missed that thread... I just wanted to clarify if this is for readability or performance... do you see this function getting called a lot in a write workload? Your comments on the mailing list thread indicate that it's performance sensitive, but I don't see how that would be the case. code enhancement - remove unnecessary if-checks in every loop in HLog class --- Key: HBASE-5110 URL: https://issues.apache.org/jira/browse/HBASE-5110 Project: HBase Issue Type: Improvement Components: wal Affects Versions: 0.90.1, 0.90.2, 0.90.4, 0.92.0 Reporter: Mikael Sitruk Priority: Minor The HLog class (method findMemstoresWithEditsEqualOrOlderThan) has unnecessary if check in a loop. static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = null; for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { if (regions == null) regions = new ArrayListbyte [](); regions.add(e.getKey()); } } return regions == null? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } The following change is suggested static byte [][] findMemstoresWithEditsEqualOrOlderThan(final long oldestWALseqid, final Mapbyte [], Long regionsToSeqids) { // This method is static so it can be unit tested the easier. Listbyte [] regions = new ArrayListbyte [](); for (Map.Entrybyte [], Long e: regionsToSeqids.entrySet()) { if (e.getValue().longValue() = oldestWALseqid) { regions.add(e.getKey()); } } return regions.size() == 0? null: regions.toArray(new byte [][] {HConstants.EMPTY_BYTE_ARRAY}); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu reopened HBASE-5099: --- 0.92 Jenkins builds have failed 4 times in a roll. TestReplication#queueFailover failed in builds 217 and 218. It failed consistently on MacBook as well. Rolling back the patches. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177845#comment-13177845 ] jirapos...@reviews.apache.org commented on HBASE-4608: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2740/ --- (Updated 2011-12-31 00:20:40.770066) Review request for hbase, Eli Collins and Todd Lipcon. Changes --- WritableContext makes things cleaner. Some space optimizations to make compression even more efficient. Summary --- Heres what I have so far. Things are written, and should work. I need to rework the test cases to test this, and put something in the config file to enable/disable. Obviously this isn't ready for commit at the moment, but I can get those two things done pretty quickly. Obviously the dictionary is incredibly simple at the moment, I'll come up with something cooler sooner. Let me know how this looks. This addresses bug HBase-4608. https://issues.apache.org/jira/browse/HBase-4608 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/HConstants.java 5120a3c src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressedKeyValue.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/Compressor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 24407af src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogKey.java f067221 src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java d9cd6de src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java cbef70f src/main/java/org/apache/hadoop/hbase/regionserver/wal/SimpleDictionary.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALDictionary.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALEdit.java e1117ef src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestSimpleDictionary.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java 59910bf src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplayCompressed.java PRE-CREATION Diff: https://reviews.apache.org/r/2740/diff Testing --- Thanks, Li HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177846#comment-13177846 ] Zhihong Yu commented on HBASE-5099: --- I reverted 0.92 patch. Now TestReplication passes on Mac. Let's find out if the patch is related to replication test failure or not. Keeping TRUNK patch in TRUNK for now since trunk build 2594 passed. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177847#comment-13177847 ] Jimmy Xiang commented on HBASE-5099: TestReplication is flaky. But it works on my ubuntu box. Let me take a look. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177849#comment-13177849 ] Shrijeet Paliwal commented on HBASE-5041: - I will update this Jira with new Patch post holidays. Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0001-HBASE-5041-Throw-error-if-table-does-not-exist.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177850#comment-13177850 ] Zhihong Yu commented on HBASE-5099: --- Please read through the test output of 0.92 builds 217 and 218. With patch 5099.92, the test failure is reproducible on MacBook. Another validation is to deploy patch 5099.92 to real clusters and see if replication works. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177852#comment-13177852 ] Zhihong Yu commented on HBASE-4608: --- @Li: Do you want submit latest patch to Hadoop QA ? Thanks HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177860#comment-13177860 ] Jimmy Xiang commented on HBASE-5099: I tried to debug this testcase but it doesn't stop at the changes I did. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177863#comment-13177863 ] Zhihong Yu commented on HBASE-5099: --- Test scripts from HBASE-4480 would be useful in reproducing the test failure. You can run TestReplication#queueFailover in a loop (on different OSes). ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177868#comment-13177868 ] chunhui shen commented on HBASE-5100: - @Zhihong I think both are ok now. I agree to commit 5100-double-exeception.txt since it is more understand understandable. Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177873#comment-13177873 ] jirapos...@reviews.apache.org commented on HBASE-4608: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2740/ --- (Updated 2011-12-31 02:06:00.510532) Review request for hbase, Eli Collins and Todd Lipcon. Changes --- fixed a failing test. Summary --- Heres what I have so far. Things are written, and should work. I need to rework the test cases to test this, and put something in the config file to enable/disable. Obviously this isn't ready for commit at the moment, but I can get those two things done pretty quickly. Obviously the dictionary is incredibly simple at the moment, I'll come up with something cooler sooner. Let me know how this looks. This addresses bug HBase-4608. https://issues.apache.org/jira/browse/HBase-4608 Diffs (updated) - src/main/java/org/apache/hadoop/hbase/HConstants.java 5120a3c src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressedKeyValue.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/CompressionContext.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/Compressor.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java 24407af src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLogKey.java f067221 src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java d9cd6de src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogWriter.java cbef70f src/main/java/org/apache/hadoop/hbase/regionserver/wal/SimpleDictionary.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALDictionary.java PRE-CREATION src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALEdit.java e1117ef src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestSimpleDictionary.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplay.java 59910bf src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestWALReplayCompressed.java PRE-CREATION Diff: https://reviews.apache.org/r/2740/diff Testing --- Thanks, Li HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt, 4608v5.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Pi updated HBASE-4608: - Attachment: 4608v5.txt HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt, 4608v5.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Pi updated HBASE-4608: - Release Note: Patch for WAL Compression. Status: Patch Available (was: Open) HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt, 4608v5.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4608) HLog Compression
[ https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177874#comment-13177874 ] Li Pi commented on HBASE-4608: -- Yup. good time to do it. On Fri, Dec 30, 2011 at 4:35 PM, Zhihong Yu (Commented) (JIRA) HLog Compression Key: HBASE-4608 URL: https://issues.apache.org/jira/browse/HBASE-4608 Project: HBase Issue Type: New Feature Reporter: Li Pi Assignee: Li Pi Attachments: 4608v1.txt, 4608v5.txt The current bottleneck to HBase write speed is replicating the WAL appends across different datanodes. We can speed up this process by compressing the HLog. Current plan involves using a dictionary to compress table name, region id, cf name, and possibly other bits of repeated data. Also, HLog format may be changed in other ways to produce a smaller HLog. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177877#comment-13177877 ] Hudson commented on HBASE-5099: --- Integrated in HBase-0.92 #219 (See [https://builds.apache.org/job/HBase-0.92/219/]) HBASE-5099 revert due to continuous 0.92 build failures tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177879#comment-13177879 ] Zhihong Yu commented on HBASE-5100: --- Thanks for the feedback, Chunhui. I integrated double exception patch to 0.92 and TRUNK. Thanks for initial patch, Chunhui. Thanks for the review, Stack. Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an
[jira] [Created] (HBASE-5111) Upgrade zookeeper to 3.4.2 release
Upgrade zookeeper to 3.4.2 release -- Key: HBASE-5111 URL: https://issues.apache.org/jira/browse/HBASE-5111 Project: HBase Issue Type: Task Reporter: Zhihong Yu Zookeeper 3.4.2 has just been released. We should upgrade to this release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5111) Upgrade zookeeper to 3.4.2 release
[ https://issues.apache.org/jira/browse/HBASE-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5111: -- Fix Version/s: 0.94.0 0.92.0 Upgrade zookeeper to 3.4.2 release -- Key: HBASE-5111 URL: https://issues.apache.org/jira/browse/HBASE-5111 Project: HBase Issue Type: Task Reporter: Zhihong Yu Fix For: 0.92.0, 0.94.0 Zookeeper 3.4.2 has just been released. We should upgrade to this release. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5064) use surefire tests parallelization
[ https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177884#comment-13177884 ] Zhihong Yu commented on HBASE-5064: --- I got the following running test suite on Linux: {code} Failed tests: testLogRollOnDatanodeDeath(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling): LowReplication Roller should've been disabled testMultipleResubmits(org.apache.hadoop.hbase.master.TestSplitLogManager): expected:2 but was:3 Tests run: 781, Failures: 2, Errors: 0, Skipped: 9 {code} where: {code} open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited {code} I think we can give v20 a chance on Jenkins. At the moment test suite reliability is more important than speed, IMHO. use surefire tests parallelization -- Key: HBASE-5064 URL: https://issues.apache.org/jira/browse/HBASE-5064 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch To be tried multiple times on hadoop-qa before committing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4397) -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time
[ https://issues.apache.org/jira/browse/HBASE-4397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-4397: -- Fix Version/s: 0.94.0 0.92.0 Summary: -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time (was: -ROOT-, .META. table stay offline for too long in the case of all RSs are shutdown at the same time) -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time - Key: HBASE-4397 URL: https://issues.apache.org/jira/browse/HBASE-4397 Project: HBase Issue Type: Bug Reporter: Ming Ma Assignee: Ming Ma Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4397-0.92.patch 1. Shutdown all RSs. 2. Bring all RS back online. The -ROOT-, .META. stay in offline state until timeout monitor force assignment 30 minutes later. That is because HMaster can't find a RS to assign the tables to in assign operation. 011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of -ROOT-,,0.70236052 to sea-lab-4,60020,1315870341387, trying to assign elsewhere instead; retry=0 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:345) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1002) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:854) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:148) at $Proxy9.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:407) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1408) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1153) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1128) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1123) at org.apache.hadoop.hbase.master.AssignmentManager.assignRoot(AssignmentManager.java:1788) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRoot(ServerShutdownHandler.java:100) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.verifyAndAssignRootWithRetries(ServerShutdownHandler.java:118) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:181) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:167) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-09-13 13:25:52,743 WARN org.apache.hadoop.hbase.master.AssignmentManager: Unable to find a viable location to assign region -ROOT-,,0.70236052 Possible fixes: 1. Have serverManager handle server online event similar to how RegionServerTracker.java calls servermanager.expireServer in the case server goes down. 2. Make timeoutMonitor handle the situation better. This is a special situation in the cluster. 30 minutes timeout can be skipped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5112) TestReplication#queueFailover flaky due to code error
TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5112: --- Attachment: hbase-5112.patch TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177886#comment-13177886 ] Jimmy Xiang commented on HBASE-5099: TestReplication#queueFailover has a bug that's why it is flaky: https://issues.apache.org/jira/browse/HBASE-5112 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-5112: --- Status: Patch Available (was: Open) TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5055) Build against hadoop 0.22 broken
[ https://issues.apache.org/jira/browse/HBASE-5055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177888#comment-13177888 ] Hudson commented on HBASE-5055: --- Integrated in HBase-0.92-security #54 (See [https://builds.apache.org/job/HBase-0.92-security/54/]) HBASE-5055 Build against hadoop 0.22 broken - remove import of DFSClient.DFSInputStream (Ming Ma) tedyu : Files : * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java Build against hadoop 0.22 broken Key: HBASE-5055 URL: https://issues.apache.org/jira/browse/HBASE-5055 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Zhihong Yu Assignee: stack Priority: Blocker Fix For: 0.92.0, 0.94.0 Attachments: 5055.txt, HBASE-5055-0.92.patch I got the following when compiling TRUNK against hadoop 0.22: {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project hbase: Compilation failure: Compilation failure: [ERROR] /Users/zhihyu/trunk-hbase/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java:[37,39] cannot find symbol [ERROR] symbol : class DFSInputStream [ERROR] location: class org.apache.hadoop.hdfs.DFSClient [ERROR] [ERROR] /Users/zhihyu/trunk-hbase/src/main/java/org/apache/hadoop/hbase/regionserver/wal/SequenceFileLogReader.java:[109,37] cannot find symbol [ERROR] symbol : class DFSInputStream [ERROR] location: class org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.WALReader.WALReaderFSDataInputStream {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5103) Fix improper master znode deserialization
[ https://issues.apache.org/jira/browse/HBASE-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177889#comment-13177889 ] Hudson commented on HBASE-5103: --- Integrated in HBase-0.92-security #54 (See [https://builds.apache.org/job/HBase-0.92-security/54/]) HBASE-5103 Fix improper master znode deserialization (Jonathan Hsieh) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java Fix improper master znode deserialization - Key: HBASE-5103 URL: https://issues.apache.org/jira/browse/HBASE-5103 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: hbase-5103.patch In ActiveMasterManager#blockUntilBecomingActiveMaster the master znode is created as a versioned serialized version of ServerName {code} if (ZKUtil.createEphemeralNodeAndWatch(this.watcher, this.watcher.masterAddressZNode, sn.getVersionedBytes())) { {code} There are a few user visible places where it is used but not deserialized properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177887#comment-13177887 ] Jimmy Xiang commented on HBASE-5112: @Ted, could you please give this patch a try on your MacBook? I could not reproduce the failure on my box. I looked into the code carefully and this fix should make this testcase not flaky any more. TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177890#comment-13177890 ] Hudson commented on HBASE-5099: --- Integrated in HBase-0.92-security #54 (See [https://builds.apache.org/job/HBase-0.92-security/54/]) HBASE-5099 revert due to continuous 0.92 build failures HBASE-5099 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on (Jimmy) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5112: -- Fix Version/s: 0.94.0 0.92.0 Issue Type: Test (was: Bug) TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5112) TestReplication#queueFailover flaky due to potentially uninitialized Scan
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5112: -- Hadoop Flags: Reviewed Summary: TestReplication#queueFailover flaky due to potentially uninitialized Scan (was: TestReplication#queueFailover flaky due to code error) TestReplication#queueFailover flaky due to potentially uninitialized Scan - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177891#comment-13177891 ] Lars Hofhansl commented on HBASE-5112: -- Nice find. +1 on patch. TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5112) TestReplication#queueFailover flaky due to potentially uninitialized Scan
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5112: -- Attachment: 5112-v2.txt I propose this patch based on Jimmy's where Thread is set as Daemon. TestReplication#queueFailover flaky due to potentially uninitialized Scan - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5112-v2.txt, hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to potentially uninitialized Scan
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177893#comment-13177893 ] Zhihong Yu commented on HBASE-5112: --- I looped TestReplication#queueFailover 5 times using both 5112-v2.txt and 5099.92 - no error I am looping TestReplication itself 5 more times. Will integrate both 5112 and 5099 if there is no error. Thanks for the New Year present, Jimmy. TestReplication#queueFailover flaky due to potentially uninitialized Scan - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5112-v2.txt, hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to potentially uninitialized Scan
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177895#comment-13177895 ] Lars Hofhansl commented on HBASE-5112: -- +1 on v2 TestReplication#queueFailover flaky due to potentially uninitialized Scan - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5112-v2.txt, hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to potentially uninitialized Scan
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177897#comment-13177897 ] Zhihong Yu commented on HBASE-5112: --- Integrated to 0.92 and TRUNK. Thanks for the patch, Jimmy. Thanks for the review, Lars. TestReplication#queueFailover flaky due to potentially uninitialized Scan - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Test Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5112-v2.txt, hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5108) ICV puts memstore before writing WAL first -- by default; make the default be 'correct' and let better perf be optional
[ https://issues.apache.org/jira/browse/HBASE-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177899#comment-13177899 ] Lars Hofhansl commented on HBASE-5108: -- To be very precise, what happens (ICV, increment, append) is that the WAL is written with the lock held, but the sync request is issued after the lock is released. So what could happen is that the other clients see the updated value in the memstore (in fact they do see it right away - see HBASE-4583). Now, if the region server dies before the sync was executed the clients might have based their logic upon uncommitted state. We cannot roll back the memstore state for ICVs because the operation is not idempotent (and for various other reasons also explained in HBASE-4583, all client scanners see the updates immediately). I am somewhat torn on this one. This failure scenario is pretty rare, and the performance implication of doing 100% correct would be significant. Maybe for ICVs there should be three different options: (1) write WAL synchronously, (2) don't write WAL, a new option (3) do a best effort WAL write. ICV puts memstore before writing WAL first -- by default; make the default be 'correct' and let better perf be optional --- Key: HBASE-5108 URL: https://issues.apache.org/jira/browse/HBASE-5108 Project: HBase Issue Type: Bug Reporter: stack Priority: Critical See this thread up on the list and Lars' note on the end: http://search-hadoop.com/m/Y6xTRp6sxq1/%2522Help+regarding+RowLock%2522subj=Help+regarding+RowLock I thought it was just ICV that did the memstore put first. This issue is about making it so the described behavior is optional and that the default out of the box goes for correctness -- i.e. write WAL first and then memstore. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5108) ICV puts memstore before writing WAL first -- by default; make the default be 'correct' and let better perf be optional
[ https://issues.apache.org/jira/browse/HBASE-5108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177901#comment-13177901 ] Lars Hofhansl commented on HBASE-5108: -- Where #3 would be the current behavior. ICV puts memstore before writing WAL first -- by default; make the default be 'correct' and let better perf be optional --- Key: HBASE-5108 URL: https://issues.apache.org/jira/browse/HBASE-5108 Project: HBase Issue Type: Bug Reporter: stack Priority: Critical See this thread up on the list and Lars' note on the end: http://search-hadoop.com/m/Y6xTRp6sxq1/%2522Help+regarding+RowLock%2522subj=Help+regarding+RowLock I thought it was just ICV that did the memstore put first. This issue is about making it so the described behavior is optional and that the default out of the box goes for correctness -- i.e. write WAL first and then memstore. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177900#comment-13177900 ] Zhihong Yu commented on HBASE-5099: --- Integrated 5099.92 to 0.92 branch again. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5100) Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177908#comment-13177908 ] Hudson commented on HBASE-5100: --- Integrated in HBase-0.92 #220 (See [https://builds.apache.org/job/HBase-0.92/220/]) HBASE-5100 Rollback of split could cause closed region to be opened again (Chunhui) tedyu : Files : * /hbase/branches/0.92/CHANGES.txt * /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java Rollback of split could cause closed region to be opened again -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5100-double-exeception.txt, 5100-v2.txt, hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE,