[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269161#comment-14269161 ] Cosmin Lehene commented on HBASE-5155: -- [~ramkrishna.s.vasude...@gmail.com] [~apurtell] still valid? > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266099#comment-13266099 ] stack commented on HBASE-5155: -- Should we revert and roll a 0.90.7? > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265633#comment-13265633 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- @David Updated the release notes. Thanks for your review. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265114#comment-13265114 ] David S. Wang commented on HBASE-5155: -- Ram, > If the HBase client does not have the changes for HBASE-5155 and the server > has the changes for HBASE-5155, then if we try to Enable a table then the > client will hang. Actually, I noticed that the hang happens in the opposite case: when the client has the changes for HBASE-5155, and the server does not. Otherwise the release note looks OK to me. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231861#comment-13231861 ] Hudson commented on HBASE-5155: --- Integrated in HBase-TRUNK-security #140 (See [https://builds.apache.org/job/HBase-TRUNK-security/140/]) HBASE-5206 Port HBASE-5155 to trunk (Ashutosh Jindal) (Revision 1301709) Result = SUCCESS tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/DeleteTableHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestDrainingServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKTable.java > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231637#comment-13231637 ] Hudson commented on HBASE-5155: --- Integrated in HBase-0.94 #36 (See [https://builds.apache.org/job/HBase-0.94/36/]) HBASE-5206 port HBASE-5155 to 0.94 (Ashutosh Jindal) (Revision 1301737) Result = SUCCESS tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/DeleteTableHandler.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKTable.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/TestDrainingServer.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKTable.java > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231579#comment-13231579 ] Hudson commented on HBASE-5155: --- Integrated in HBase-TRUNK #2685 (See [https://builds.apache.org/job/HBase-TRUNK/2685/]) HBASE-5206 Port HBASE-5155 to trunk (Ashutosh Jindal) (Revision 1301709) Result = SUCCESS tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/DeleteTableHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestDrainingServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKTable.java > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229926#comment-13229926 ] Hudson commented on HBASE-5155: --- Integrated in HBase-TRUNK-security #138 (See [https://builds.apache.org/job/HBase-TRUNK-security/138/]) HBASE-5206 port HBASE-5155 to TRUNK (Ashutosh Jindal) (Revision 1300711) Result = SUCCESS tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/DeleteTableHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKTable.java > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229638#comment-13229638 ] Hudson commented on HBASE-5155: --- Integrated in HBase-TRUNK #2680 (See [https://builds.apache.org/job/HBase-TRUNK/2680/]) HBASE-5206 port HBASE-5155 to TRUNK (Ashutosh Jindal) (Revision 1300711) Result = FAILURE tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/DeleteTableHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMaster.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterRestartAfterDisablingTable.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZKTable.java > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186299#comment-13186299 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- Committed to 0.90. Stack , Ted and Chunhui for the review. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_3.patch, HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186243#comment-13186243 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- I am planning to commit this today. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186169#comment-13186169 ] Zhihong Yu commented on HBASE-5155: --- HBASE-5155_2.patch looks good to me. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_2.patch, > HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186089#comment-13186089 ] Zhihong Yu commented on HBASE-5155: --- bq. will tell the user like to ZK is used in checking it. I think you wanted to say 'will tell the user like no ZK is used in checking it.' I don't have other comments, except the one @ 13/Jan/12 21:34 If you have time, please port this to 0.92 Otherwise we can open another JIRA. Good job, Ramkrishna. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_latest.patch, > hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186084#comment-13186084 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- bq.This looks like a method used internally by AM only. Does it need to be public? {code} + public void setEnabledTable(String tableName) { {code} I did not have this as public in the beginning. But later in HMaster.rebuildUserRegions() i had to set enabled table. So thought of exposing this from AM so that i can use it there instead of repeating the same code in HMaster. bq.Have you tried it? In rolling restart we'll upgrade the master first usually. Won't it know how to deal w/ new zk node for ENABLED state? If master is restarted first even then the above changes will be necessary as when the master builds the table state he will not find the ENABLED state in zk. So the above changes in Master will help him to build that state. Yes rolling restart was tested. bq.FYI, don't do these kinda changes in future: When applying a formatter it happened. Sure Stack i will take care of those changes. {code}public boolean isEnabledTable(String tableName) { -synchronized (this.cache) { - // No entry in cache means enabled table. - return !this.cache.containsKey(tableName); -} +return isTableState(tableName, TableState.ENABLED); {code} The isTableState will anyway have a synchronized(this.cache) so it should be ok? {code} + * Check if the table is in DISABLED state in cache {code} My idea of adding 'in cache' was like the state is checked only in Memory and it is not going to zk to check the state . So i thought like the 'in cache' word will tell the user like to ZK is used in checking it. {code} +// Enable the ROOT table if on process fail over the RS containing ROOT +// was active. {code} This scenario comes when the master is restarted but the RS is still alive. Now the master should enable the ROOT and META also because when he comes up he should create the enabled node in zk. If we don't do this step then for ROOT and META we will not have a node in zk in the above scenario. But if the master explicitly assign ROOT and META then there will be a zk node. So to unify this i had to do the zkTable.setEnabledTable(). @Stack Is it fine Stack? I can reprepare a patch based on your feedback and then upload a final one? @Ted You have any more comments or feedback so that i can incorporate in the next patch. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_latest.patch, > hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), tr
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185878#comment-13185878 ] Zhihong Yu commented on HBASE-5155: --- Minor comment: {code} +boolean istableEnabled = this.zkTable.isEnabledTable(tableName); {code} istableEnabled should be named isTableEnabled. @Stack: w.r.t the following comment: {code} +// Enable the ROOT table if on process fail over the RS containing ROOT +// was active. {code} AssignmentManager delegates to this.zkTable.setEnabledTable(). This is to set the meta tables enabled in ZkTable cache. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_latest.patch, > hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185858#comment-13185858 ] stack commented on HBASE-5155: -- bq. So if we have a rolling restart scenario then this will be a problem right ? Previously the table node will not be present for the Enabled state but now we will create it. Have you tried it? In rolling restart we'll upgrade the master first usually. Won't it know how to deal w/ new zk node for ENABLED state? FYI, don't do these kinda changes in future: {code} - for (HRegionInfo region: regions) { + for (HRegionInfo region : regions) { {code} What was there previous was fine... It adds bulk to your patch. This looks like a method used internally by AM only. Does it need to be public? {code} + public void setEnabledTable(String tableName) { {code} In processDeadRegion, should we check parent exists before doing daughter fixups? (It could have been deleted?) I don't undersand this comment: {code} +// Enable the ROOT table if on process fail over the RS containing ROOT +// was active. {code} Same for the one on .meta. Why we have to enable the meta and root tables? Aren't they always on? Is this right: {code} + * Check if the table is in DISABLED state in cache {code} Is it just checking cache? This class gets updated when the zk changes right? So its not just a 'cache'? I think should drop 'from cache' in your public javadoc. Same for isDisabling, etc Is this right below: {code} public boolean isEnabledTable(String tableName) { -synchronized (this.cache) { - // No entry in cache means enabled table. - return !this.cache.containsKey(tableName); -} +return isTableState(tableName, TableState.ENABLED); {code} Else patch looks good to me. Was afraid it too much for 0.90.6 but its looking ok. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_1.patch, HBASE-5155_latest.patch, > hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdmi
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185748#comment-13185748 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- Yes Ted. Some of that part was refactored by some defect by Ming. Also in trunk as HBASE-4083 is there we have disablingTables and also enablingTables. So for trunk may be we may have to apply the changes considering the code there. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185745#comment-13185745 ] Zhihong Yu commented on HBASE-5155: --- I tried to port the patch to trunk. It turns out that AssignmentManager.java is quite different between 0.90 and trunk. e.g. the following code in rebuildUserRegions() of 0.90: {code} Set disablingTables = new HashSet(1); {code} But in trunk, disablingTables is a field in AssignmentManager > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185723#comment-13185723 ] Zhihong Yu commented on HBASE-5155: --- +1. Minor comments: {code} +if (true == checkIfRegionBelongsToDisabled(regionInfo)) { + disabled = true; +} {code} Can the above be written as: {code} disabled = checkIfRegionBelongsToDisabled(regionInfo); {code} {code} +// need to enable the table if not disable or disabling {code} Should read 'not disabled ...' > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185634#comment-13185634 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- Pls provide your comments. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_latest.patch, hbase-5155_6.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185201#comment-13185201 ] Zhihong Yu commented on HBASE-5155: --- Test suite passed based on Ram's patch. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Fix For: 0.90.6 > > Attachments: HBASE-5155_latest.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185114#comment-13185114 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- @Ted Thanks for your review. I will address all the comments. I will do some more testing tomorrow and then submit an updated patch. Currently i can verify in 0.90. Trunk will be doing sometime later next week. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Attachments: HBASE-5155_latest.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185095#comment-13185095 ] Zhihong Yu commented on HBASE-5155: --- In AssignmentManager.java, setEnabledTable(): {code} + LOG.error("Unable to ensure that the table will be" + + " enabled because of a ZooKeeper issue"); {code} Please include tableName in the log. In bulkAssignUserRegions(): {code} +List regionsList = java.util.Arrays.asList(regions); +for (HRegionInfo regionInfo : regionsList) { {code} Can we directly iterate over regions array ? In ZKTable.java: {code} - if (!isEnabledOrDisablingTable(tableName)) { + if (isEnabledOrDisablingTable(tableName)) { LOG.warn("Moving table " + tableName + " state to disabling but was " + "not first in enabled state: " + this.cache.get(tableName)); {code} Why was the above change necessary ? Now the warning doesn't match the check. I see some long line: {code} TEST_UTIL.createTable(TABLENAME, FAMILYNAME); + assertTrue(m.assignmentManager.getZKTable().isEnabledTable(Bytes.toString(TABLENAME))); {code} Overall, this patch looks very good. Thanks for plugging a hole w.r.t. cache in ZkTable. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Attachments: HBASE-5155_latest.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185030#comment-13185030 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- TestRollingRestart is passing. I have tried to handle the different scenarios. TestMasterFailOver related scenarios also handled. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > Attachments: HBASE-5155_latest.patch > > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184844#comment-13184844 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- @Stack I am afraid will this cause any compatability issues? Because now we try to create the enabled node and enabled state. If the master fails over we still go with the node presence in zk to form the zkTable.cache. So if we have a rolling restart scenario then this will be a problem right ? Previously the table node will not be present for the Enabled state but now we will create it. :( > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184164#comment-13184164 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- I could not upload the patch today as still some test case is failing. Will upload it tomorrow. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183898#comment-13183898 ] stack commented on HBASE-5155: -- Ugh, I wrote a comment and lost it. So, when you say above that '-> The tables and its regions are deleted including R1, D1 and D2.. (So META is cleaned)', you are saying that this happens AFTER SSH has scanned .META. and that we're doing the region processing AFTER the deletes (I was going to say its odd that we fixup a daughter when parent is missing but checking .META. and filesystem before each daughter fixup would still have a hole during which a delete could come in) On keeping a TableState.ENABLED up in zk, that could work (Can't remember why didn't do it that way originally -- only thought is that was trying to save on the state kept up in zk which is a pretty pathetic reason). You'll need to add an AM.isEnabledTable method to match the isDisabledTable, etc., stuff that is already there. Good stuff Ram. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183786#comment-13183786 ] chunhui shen commented on HBASE-5155: - I think using TableState.ENABLED is helpful and HMaster.TableDescriptors has similar function. But in this issue should we consider the situation of the deleted table is created again? Maybe SSH could differentiate the above situation. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183592#comment-13183592 ] Zhihong Yu commented on HBASE-5155: --- +1 on utilizing TableState.ENABLED Nice finding, Ram. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183205#comment-13183205 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- @Stack After analysing the code found one thing. May be avoiding SSH and DisableTableHandler and DeleteTableHandler parallely is a bigger discussion. But the above problem can be solved. In SSH {code} public static boolean processDeadRegion(HRegionInfo hri, Result result, AssignmentManager assignmentManager, CatalogTracker catalogTracker) throws IOException { // If table is not disabled but the region is offlined, boolean disabled = assignmentManager.getZKTable().isDisabledTable( hri.getTableDesc().getNameAsString()); {code} we check if the table is disabled. But if you look at the above logs it is the DeleteTableHandler that has already deleted the region and also removed the cache from ZkTable. {code} am.getZKTable().setEnabledTable(Bytes.toString(tableName)); {code} Currently setEnabledTable means removing the entry from the map. So we do not have a differentiation between enabled table and delete the table because both places we remove from the cache map. So can we use the unused TableState.ENABLED in case of enable table handler and only delete table handler will remove it. This will ensure that in SSH.processDeadRegion() we can first check if the table is not present in the map and then proceed. If not present we can ensure that the table is already deleted. Pls give your opinion. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183122#comment-13183122 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- {code} 2012-01-10 11:43:34,303 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: j9t6,,1326109762514.adcbae41a5024c60c72f5752c6e1c8d4.: Daughters; j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7., j9t6,23443]5767435g,1326176002507.0b96b5ed4c0426d3b3f13e586179c9bc. from linux-129,60020,1326175677339 2012-01-10 12:05:19,122 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for linux-129,60020,1326175677339 2012-01-10 12:06:07,153 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [linux-129,60020,1326175677339] 2012-01-10 12:09:57,865 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 7 region(s) that linux-129,60020,1326175677339 was carrying (skipping 0 regions(s) that are already in transition) 2012-01-10 12:11:30,988 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Attemping to disable table j9t6 2012-01-10 12:12:21,513 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table is done=true 2012-01-10 12:13:41,624 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table operation C_M_DELETE_TABLE on table j9t6 2012-01-10 12:14:00,811 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting region j9t6,,1326109762514.adcbae41a5024c60c72f5752c6e1c8d4. from META and FS 2012-01-10 12:14:02,230 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region j9t6,,1326109762514.adcbae41a5024c60c72f5752c6e1c8d4. from META 2012-01-10 12:14:07,330 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting region j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. from META and FS 2012-01-10 12:14:07,521 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. from META 2012-01-10 12:14:09,860 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Deleting region j9t6,23443]5767435g,1326176002507.0b96b5ed4c0426d3b3f13e586179c9bc. from META and FS 2012-01-10 12:14:10,096 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Deleted region j9t6,23443]5767435g,1326176002507.0b96b5ed4c0426d3b3f13e586179c9bc. from META 2012-01-10 12:18:11,081 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Offlined and split region j9t6,,1326109762514.adcbae41a5024c60c72f5752c6e1c8d4.; checking daughter presence 2012-01-10 12:18:46,450 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Fixup; missing daughter j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. 2012-01-10 12:18:46,775 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Added daughter j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. in region .META.,,1, serverInfo=null 2012-01-10 12:18:47,135 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x134c5dbd0a6 Creating (or updating) unassigned node for 49c3665a4bc656f3f6473659b64798f7 with OFFLINE state 2012-01-10 12:18:47,142 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. so generated a random one; hri=j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7., src=, dest=linux146,60020,1326169560093; 1 (online=1, exclude=null) available servers 2012-01-10 12:18:47,143 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region j9t6,,1326176002507.49c3665a4bc656f3f6473659b64798f7. to linux146,60020,1326169560093 2012-01-10 12:18:47,155 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handle region called from node nodeDataChanged 2012-01-10 12:18:47,155 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1326169560093, region=49c3665a4bc656f3f6473659b64798f7 2012-01-10 12:18:47,202 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handle region called from node nodeDataChanged 2012-01-10 12:18:47,202 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=linux146,60020,1326169560093, region=49c3665a4bc656f3f6473659b64798f7 2012-01-10 12:18:47,221 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handle region called from node nodeDataChanged 2012-01-10 12:18:47,221 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=linux146,60020,1326169560093, region=49c3665a4bc656f3f6473659b64798f7 2012-01-10 12:18:47,222 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event fo
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182932#comment-13182932 ] stack commented on HBASE-5155: -- @Ram Nice one. Do you have a snippet of log that shows this? So, ServerShutdownHandler should be checking if table is disabled before it does either fixup or assign? (Thats what the check of (hri.isOffline()...) is supposed to be doing only the enable/disable semantic changed so that now when a table is disabled, we now set a flag for the table in zk rather than do it individually on each region; i.e. offline it). Or, are you saying the table was completely deleted when servershutdownhandler started to run? If so, then the create of the region should fail; we should make sure that if the parent table directory not present, then the we should not be able to create region subdirs. We'd need a mkdir that did not do a recursive create (we need newer hadoop/hdfs for this?) On the question of synchronization between DeleteTableHandler and ServerShutdownHandler, yes, we need to have all threads in master coordinate around state changes whether the balancer thread, servershutdownhander executor thread, incoming splits, etc. I'd like to put up a harness in which we can repro all these race conditions... HBase-3154 helps with this (the test included shows how to mock a balance and a server shutdown handler -- would need to make them interleave or have them reproduce this issue -- the log would help with reproducing the event sequence). > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182715#comment-13182715 ] Zhihong Yu commented on HBASE-5155: --- I think Ram's question @ 09/Jan/12 17:23 hints at introducing synchronization between DeleteTableHandler and ServerShutdownhandler. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182653#comment-13182653 ] Zhihong Yu commented on HBASE-5155: --- Then we need to detect whether the table being deleted/disabled has region on the underlying server. > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182648#comment-13182648 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- Can we prevent disable and delete table from happening if ServerShutDownHandler is in progress? > ServerShutDownHandler And Disable/Delete should not happen parallely leading > to recreation of regions that were deleted > --- > > Key: HBASE-5155 > URL: https://issues.apache.org/jira/browse/HBASE-5155 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 0.90.4 >Reporter: ramkrishna.s.vasudevan >Priority: Blocker > > ServerShutDownHandler and disable/delete table handler races. This is not an > issue due to TM. > -> A regionserver goes down. In our cluster the regionserver holds lot of > regions. > -> A region R1 has two daughters D1 and D2. > -> The ServerShutdownHandler gets called and scans the META and gets all the > user regions > -> Parallely a table is disabled. (No problem in this step). > -> Delete table is done. > -> The tables and its regions are deleted including R1, D1 and D2.. (So META > is cleaned) > -> Now ServerShutdownhandler starts to processTheDeadRegion > {code} > if (hri.isOffline() && hri.isSplit()) { > LOG.debug("Offlined and split region " + hri.getRegionNameAsString() + > "; checking daughter presence"); > fixupDaughters(result, assignmentManager, catalogTracker); > {code} > As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 > {code} > if (isDaughterMissing(catalogTracker, daughter)) { > LOG.info("Fixup; missing daughter " + daughter.getRegionNameAsString()); > MetaEditor.addDaughter(catalogTracker, daughter, null); > // TODO: Log WARN if the regiondir does not exist in the fs. If its not > // there then something wonky about the split -- things will keep going > // but could be missing references to parent region. > // And assign it. > assignmentManager.assign(daughter, true); > {code} > we call assign of the daughers. > Now after this we again start with the below code. > {code} > if (processDeadRegion(e.getKey(), e.getValue(), > this.services.getAssignmentManager(), > this.server.getCatalogTracker())) { > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Now when the SSH scanned the META it had R1, D1 and D2. > So as part of the above code D1 and D2 which where assigned by fixUpDaughters > is again assigned by > {code} > this.services.getAssignmentManager().assign(e.getKey(), true); > {code} > Thus leading to a zookeeper issue due to bad version and killing the master. > The important part here is the regions that were deleted are recreated which > i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira