[ 
https://issues.apache.org/jira/browse/HBASE-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781502#comment-13781502
 ] 

Jimmy Xiang edited comment on HBASE-9514 at 9/29/13 9:25 PM:
-------------------------------------------------------------

Here is the list of changes:
1. fixed a bug in AM#assign(line ~2645), when bulk assign fails, each region 
should be assigned again, otherwise, they will be stuck in transition;
2. fixed a bug in AM#unassign(line ~2461), if region is offline, assign it 
again (moved to final block, so all scenarios are covered);
3. in RegionStates if the last hosting region server is online, get the 
server's info to confirm it has the expected start code (may be too 
conservative, hasn't seen it in my test yet);
4. in AM, force region state offline, if force new plan, check meta to make 
sure the last assignment is not changed (may be too conservative, hasn't seen 
it in my test yet);
5. enhanced bulk assign a little so that if a region is already assign, no need 
to force assign.

I have a new patch in testing now (v5.1 attached). The new patch has the 
following changes:
1. added a CM action to log cluster status every 90 seconds so we know details 
about regions in transition;
2. added a hbck check after verification failure so that we know if the cluster 
is consistent, i.e., any region is lost/unassigned;
3. added another verify with CM disabled after verification failure so we know 
if we really have data loss.

It seems that there is no data loss now since 3. shows ok while the test still 
fails.


was (Author: jxiang):
Here is the list of changes:
1. fixed a bug in AM#assign(line ~2645), when bulk assign fails, each region 
should be assigned again, otherwise, they will be stuck in transition;
2. fixed a bug in AM#unassign(line ~2461), if region is offline, assign it 
again (moved to final block, so all scenarios are covered);
3. in RegionStates if the last hosting region server is online, get the 
server's info to confirm it has the expected start code (may be too 
conservative, hasn't seen it in my test yet);
4. in AM, force region state offline, if force new plan, check meta to make 
sure the last assignment is not changed (may be too conservative, hasn't seen 
it in my test yet);
5. enhanced bulk assign a little so that if a region is already assign, no need 
to force assign.

I have a new patch in testing now. The new patch has the following changes:
1. added a CM action to log cluster status every 90 seconds so we know details 
about regions in transition;
2. added a hbck check after verification failure so that we know if the cluster 
is consistent, i.e., any region is lost/unassigned;
3. added another verify with CM disabled after verification failure so we know 
if we really have data loss.

It seems that there is no data loss now since 3. shows ok while the test still 
fails.

> Prevent region from assigning before log splitting is done
> ----------------------------------------------------------
>
>                 Key: HBASE-9514
>                 URL: https://issues.apache.org/jira/browse/HBASE-9514
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>            Priority: Blocker
>             Fix For: 0.96.0
>
>         Attachments: trunk-9514_v1.patch, trunk-9514_v2.patch, 
> trunk-9514_v3.patch, trunk-9514_v5.1.patch, trunk-9514_v5.patch
>
>
> If a region is assigned before log splitting is done by the server shutdown 
> handler, the edits belonging to this region in the hlogs of the dead server 
> will be lost.
> Generally this is not an issue if users don't assign/unassign a region from 
> hbase shell or via hbase admin. These commands are marked for experts only in 
> the hbase shell help too.  However, chaos monkey doesn't care.
> If we can prevent from assigning such regions in a bad time, it would make 
> things a little safer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to