[ https://issues.apache.org/jira/browse/HBASE-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang updated HBASE-20864: ------------------------------ Issue Type: Sub-task (was: Bug) Parent: HBASE-20828 > RS was killed due to master thought the region should be on a already dead > server > --------------------------------------------------------------------------------- > > Key: HBASE-20864 > URL: https://issues.apache.org/jira/browse/HBASE-20864 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0 > Reporter: Allan Yang > Priority: Major > Attachments: log.zip > > > When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 > backported and with other two issues: HBASE-20706, HBASE-20752). I found two > of my RS killed by master since master has a different region state with > those RS. It is very strange that master thought these region should be on a > already dead server. There might be a serious bug, but I haven't found it > yet. Here is the process: > 1. e010125048153.bja,60020,1531137365840 is crashed, and clearly > 4423e4182457c5b573729be4682cc3a3 was assigned to > e010125049164.bja,60020,1531136465378 during ServerCrashProcedure > {code:java} > 2018-07-09 20:03:32,443 INFO [PEWorker-10] procedure.ServerCrashProcedure: > Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false > 2018-07-09 20:03:39,220 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=60000] > assignment.RegionTransitionProcedure: Received report OPENED seqId=16021, > pid=2305, ppid=2303, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > AssignProcedure table=IntegrationTestBigLinkedList, > region=4423e4182457c5b573729be4682cc3a3; rit=OPENING, > location=e010125049164.bja,60020,1531136465378 > 2018-07-09 20:03:39,220 INFO [PEWorker-13] assignment.RegionStateStore: > pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, > regionState=OPEN, openSeqNum=16021, > regionLocation=e010125049164.bja,60020,1531136465378 > 2018-07-09 20:03:43,190 INFO [PEWorker-12] procedure2.ProcedureExecutor: > Finished pid=2303, state=SUCCESS; ServerCrashProcedure > server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false in > 10.7490sec > {code} > 2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was > reopend on e010125049164.bja,60020,1531136465378 > {code:java} > 2018-07-09 20:04:39,929 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=60000] > assignment.RegionTransitionProcedure: Received report OPENED seqId=16024, > pid=2351, ppid=2314, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > AssignProcedure table=IntegrationTestBigLinkedList, > region=4423e4182457c5b573729be4682cc3a3, > target=e010125049164.bja,60020,1531136465378; rit=OPENING, > location=e010125049164.bja,60020,1531136465378 > 2018-07-09 20:04:40,554 INFO [PEWorker-6] assignment.RegionStateStore: > pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3, > regionState=OPEN, openSeqNum=16024, > regionLocation=e010125049164.bja,60020,1531136465378 > {code} > 3. Active master was killed, the backup master took over, but when loading > meta entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the > privous dead server e010125048153.bja,60020,1531137365840. That is very very > strange!!! > {code:java} > 2018-07-09 20:06:17,985 INFO [master/e010125048016:60000] > assignment.RegionStateStore: Load hbase:meta entry > region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN, > lastHost=e010125049164.bja,60020,1531136465378, > regionLocation=e010125048153.bja,60020,1531137365840, openSeqNum=16024 > {code} > 4. the rs was killed > {code:java} > 2018-07-09 20:06:20,265 WARN > [RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=60000] > assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378: > rit=OPEN, location=e010125048153.bja,60020,1531137365840, > table=IntegrationTestBigLinkedList, > region=4423e4182457c5b573729be4682cc3a3reported OPEN on > server=e010125049164.bja,60020,1531136465378 but state has otherwise. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)