[ https://issues.apache.org/jira/browse/HBASE-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das resolved HBASE-9777. -------------------------------- Resolution: Duplicate HBASE-9773 should fix this. Resolving. > Two consecutive RS crashes could lead to their SSH stepping on each other's > toes and cause master abort > ------------------------------------------------------------------------------------------------------- > > Key: HBASE-9777 > URL: https://issues.apache.org/jira/browse/HBASE-9777 > Project: HBase > Issue Type: Bug > Reporter: Devaraj Das > > Here is the sequence of events (with a version of 0.96 very close to RC5 > version created on 10/11): > 1. Master assigns regions to some server RS1. One particular region is > 300d71b112325d43b99b6148ec7bc5b3 > 2. RS1 crashes > 3. Master tries to bulk-reassign (this has retries as well) the regions to > other RSs. Let's say one of them is RS2. > {noformat} > 2013-10-14 21:16:22,218 INFO > [hor13n02.gq1.ygridcore.net,60000,1381784464025-GeneralBulkAssigner-0] > master.RegionStates: Transitioned {300d71b112325d43b99b6148ec7bc5b3 > state=OFFLINE, ts=1381785382125, server=null} to > {300d71b112325d43b99b6148ec7bc5b3 state=PENDING_OPEN, ts=1381785382218, > server=hor13n04.gq1.ygridcore.net,60020,1381784772417} > {noformat} > 4. RS2 crashes > 5. The ServerShutdownHandler for RS2 is executed, and it tries to reassign > the regions. > {noformat} > 2013-10-14 21:16:32,185 INFO [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] > master.RegionStates: Found opening region {300d71b112325d43b99b6148ec7bc5b3 > state=PENDING_OPEN, ts=1381785382218, > server=hor13n04.gq1.ygridcore.net,60020,1381784772417} to be reassigned by > SSH for hor13n04.gq1.ygridcore.net,60020,1381784772417 > {noformat} > 6. (5) succeeds. The region states are made OPEN. > 7. The retry from (3) kicks in > {noformat} > 2013-10-14 21:16:22,222 INFO [MASTER_SERVER_OPERATIONS-hor13n02:60000-1] > master.GeneralBulkAssigner: Failed assigning 52 regions to server > hor13n04.gq1.ygridcore.net,60020,1381784772417, reassigning them > {noformat} > 8. The retry finds some region state as OPEN, and the master aborts with the > stack trace: > {noformat} > 2013-10-14 21:16:34,342 FATAL AM.-pool1-t46 master.HMaster: Unexpected state : > {300d71b112325d43b99b6148ec7bc5b3 state=OPEN, ts=1381785392864, > server=hor13n08.gq1.ygridcore.net,60020,1381785385596} .. Cannot transit it > to OFFLINE. > java.lang.IllegalStateException: Unexpected state : > {300d71b112325d43b99b6148ec7bc5b3 state=OPEN, ts=1381785392864, > server=hor13n08.gq1.ygridcore.net,60020,1381785385596} > .. Cannot transit it to OFFLINE. > at > org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:2074) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1855) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1449) > at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:45) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {noformat} -- This message was sent by Atlassian JIRA (v6.1#6144)