[ https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Busbey updated HBASE-13605: -------------------------------- Fix Version/s: (was: 1.1.4) (was: 1.2.0) Status: Open (was: Patch Available) Moving out of patch available status, based on Enis' comment about things needing a redesign. For the same reason, unbooking from 1.1.z and 1.2.z. If folks expect such a redesign might happen in those versions, let's have the discussion on dev@. > RegionStates should not keep its list of dead servers > ----------------------------------------------------- > > Key: HBASE-13605 > URL: https://issues.apache.org/jira/browse/HBASE-13605 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Critical > Fix For: 2.0.0, 1.3.0 > > Attachments: hbase-13605_v1.patch, hbase-13605_v3-branch-1.1.patch, > hbase-13605_v4-branch-1.1.patch, hbase-13605_v4-master.patch > > > As mentioned in > https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761 > and HBASE-12844 we should have only 1 source of cluster membership. > The list of dead server and RegionStates doing it's own liveliness check > (ServerManager.isServerReachable()) has caused an assignment problem again in > a test cluster where the region states "thinks" that the server is dead and > SSH will handle the region assignment. However the RS is not dead at all, > living happily, and never gets zk expiry or YouAreDeadException or anything. > This leaves the list of regions unassigned in OFFLINE state. > master assigning the region: > {code} > 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: > Onlined 77dddcd50c22e56bfff133c0e1f9165b on > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => > 77dddcd50c > {code} > Master then disabled the table, and unassigned the region: > {code} > 2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1] > zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING > to DISABLING > Starting unassign of > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), > current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, > ts=1429520545780, > server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268} > bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. > 2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates: > Offlined 77dddcd50c22e56bfff133c0e1f9165b from > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} > On table re-enable, AM does not assign the region: > {code} > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart > assignment.ยท > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 > server(s), retainAssignment=true > l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't > reach online server > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: > Updating the state to OFFLINE to allow to be reassigned by SSH > nmentManager: Skip assigning > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead > but not processed yet server: > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)