[ 
https://issues.apache.org/jira/browse/HBASE-21843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760764#comment-16760764
 ] 

Duo Zhang commented on HBASE-21843:
-----------------------------------

OK, if this could happen then I think there are serious bugs in our code, as 
the AssignProcedure is succeeded, we should have already updated the meta 
region. But when restarting, the record in meta is still the old one.

So why your HDFS cluster entered the safe mode? Did all the region servers also 
crash and restart? Which region server hosted the meta region when you updating 
it? Is it still alive after HDFS came back? If it was also dead, then did the 
log splitting work finish normally? Did we have data lost when HDFS back from 
safe mode?

Thanks.

> AM misses region assignment in catastrophic scenarios where RS assigned to 
> the region in Meta does not have a WAL dir.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21843
>                 URL: https://issues.apache.org/jira/browse/HBASE-21843
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.1.0, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: HBASE-21843.master.001.patch
>
>
> A bit unusual, but managed to face this twice lately on both distributed and 
> local standalone mode, on VMs. Somehow, after some VM pause/resume, got into 
> a situation where regions on meta were assigned to a give RS startcode that 
> had no corresponding WAL dir.
> That caused those regions to never get assigned, because the given RS 
> startcode is not found anywhere by RegionServerTracker/ServerManager, so no 
> SCP is created to this RS startcode, leaving the region "open" on a dead 
> server forever, in META.
> Could get this sorted by adding extra check on loadMeta, checking if the RS 
> assigned to the region in meta is not online and doesn't have a WAL dir, then 
> mark this region as offline. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to