[ https://issues.apache.org/jira/browse/HBASE-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HBASE-7799: ------------------------------------ Attachment: org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt.gz Attaching log. Note that it has a buggy experimental feature which currently makes HCM retry longer, but this should have no bearing on the problem... > reassigning region stuck in open still may not work correctly due to leftover > ZK node > ------------------------------------------------------------------------------------- > > Key: HBASE-7799 > URL: https://issues.apache.org/jira/browse/HBASE-7799 > Project: HBase > Issue Type: Bug > Reporter: Sergey Shelukhin > Attachments: > org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt.gz > > > (logs grepped by region name, and abridged. > META server was dead so OpenRegionHandler for the region took a while, and > was interrupted: > {code} > 2013-02-08 14:35:01,555 DEBUG > [RS_OPEN_REGION-10.11.2.92,64485,1360362800564-2] > handler.OpenRegionHandler(255): Interrupting thread > Thread[PostOpenDeployTasks:871d1c3bdf98a2c93b527cb6cc61327d,5,main] > {code} > Then master tried to force region offline and reassign: > {code} > 2013-02-08 14:35:06,500 INFO > [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1] > master.RegionStates(347): Found opening region > {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to > be reassigned by SSH for 10.11.2.92,64485,1360362800564 > 2013-02-08 14:35:06,500 INFO > [MASTER_SERVER_OPERATIONS-10.11.2.92,64483,1360362800340-1] > master.RegionStates(242): Region {NAME => > 'IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d.', > STARTKEY => '7333332c', ENDKEY => '7ffffff8', ENCODED => > 871d1c3bdf98a2c93b527cb6cc61327d,} transitioned from > {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=OPENING, ts=1360362901596, server=10.11.2.92,64485,1360362800564} to > {IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=CLOSED, ts=1360362906500, server=null} > 2013-02-08 14:35:06,505 DEBUG > [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1] > master.AssignmentManager(1530): Forcing OFFLINE; > was={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=CLOSED, ts=1360362906500, server=null} > 2013-02-08 14:35:06,506 DEBUG > [10.11.2.92,64483,1360362800340-GeneralBulkAssigner-1] > zookeeper.ZKAssign(176): master:64483-0x13cbbf1025d0000 Async create of > unassigned node for 871d1c3bdf98a2c93b527cb6cc61327d with OFFLINE state > {code} > But didn't delete the original ZK node? > {code} > 2013-02-08 14:35:06,509 WARN [main-EventThread] master.OfflineCallback(59): > Node for /hbase/region-in-transition/871d1c3bdf98a2c93b527cb6cc61327d already > exists > 2013-02-08 14:35:06,509 DEBUG [main-EventThread] master.OfflineCallback(69): > rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=OFFLINE, ts=1360362906506, server=null}, > server=10.11.2.92,64488,1360362800651 > 2013-02-08 14:35:06,512 DEBUG [main-EventThread] > master.OfflineCallback$ExistCallback(106): > rs={IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > state=OFFLINE, ts=1360362906506, server=null}, > server=10.11.2.92,64488,1360362800651 > {code} > So it went into infinite cycle of failing to assign due to this: > {code} > 2013-02-08 14:35:06,517 INFO [PRI IPC Server handler 7 on 64488] > regionserver.HRegionServer(3435): Received request to open region: > IntegrationTestRebalanceAndKillServersTargeted,7333332c,1360362805563.871d1c3bdf98a2c93b527cb6cc61327d. > on 10.11.2.92,64488,1360362800651 > 2013-02-08 14:35:06,521 WARN > [RS_OPEN_REGION-10.11.2.92,64488,1360362800651-0] zookeeper.ZKAssign(762): > regionserver:64488-0x13cbbf1025d0004 Attempt to transition the unassigned > node for 871d1c3bdf98a2c93b527cb6cc61327d from M_ZK_REGION_OFFLINE to > RS_ZK_REGION_OPENING failed, the node existed but was in the state > RS_ZK_REGION_OPENING set by the server [wrong server name redacted, see > HBASE-7798] > {code} > Transitioning failed-to-open similarly fails. > It seems like master needs to nuke ZK node unconditionally to offline? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira