[ 
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744866#comment-14744866
 ] 

Shuaifeng Zhou commented on HBASE-14407:
----------------------------------------

In master log, the error is happened like this(0.98.10):

1, master open region timeout:
{noformat}
2015-09-06 01:35:59,521 DEBUG [hm,60000,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(1768): Bulk assigner openRegion() to 
hs4,60020,1441213185092 has timed out, but the regions might already be opened 
on it.
java.net.SocketTimeoutException: Call to hs4/15.173.0.115:60020 failed because 
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/15.173.0.110:33771 remote=hs4/15.173.0.115:60020]
{noformat}
2, master retried open region, found that region already opened, delete node
{noformat}
2015-09-06 01:36:07,063 DEBUG [hm,60000,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(2293): ALREADY_OPENED 
NB_APP_BEHAVIOR_LABEL_000000,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
 to hs4,60020,1441213185092
{noformat}

3, delete zk node in processAlreadyOpenedRegion, but region state not match 
(because this is a retry, region opened previously )
{noformat}
2015-09-06 01:36:07,073 WARN  [hm,60000,1438368907764-GeneralBulkAssigner-19] 
zookeeper.ZKAssign(458): master:60000-0x24ee432542d01eb, 
quorum=hs5:2181,hs4:2181,hm:2181, baseZNode=/hbase Attempting to delete 
unassigned node 2070d83bfd7c3fa6950c859ce842039e in M_ZK_REGION_OFFLINE state 
but node is in RS_ZK_REGION_OPENED state
2015-09-06 01:36:07,073 INFO  [hm,60000,1438368907764-GeneralBulkAssigner-19] 
master.AssignmentManager(3614): Failed to delete the offline node for 
2070d83bfd7c3fa6950c859ce842039e. The node type may not match
{noformat}
at the same time, will modiry regionStates in master memory:
{code:title=AssignmentManager.java|borderStyle=solid}
  private void processAlreadyOpenedRegion(HRegionInfo region, ServerName sn) {
    // Remove region from in-memory transition and unassigned node from ZK
    // While trying to enable the table the regions of the table were
    // already enabled.
    LOG.debug("ALREADY_OPENED " + region.getRegionNameAsString()
      + " to " + sn);
    String encodedName = region.getEncodedName();
    deleteNodeInStates(encodedName, "offline", sn, 
EventType.M_ZK_REGION_OFFLINE);
    regionStates.regionOnline(region, sn);
  }
{code}

4, handling previous success open region zk event (delayed)
{noformat}
2015-09-06 01:36:07,073 INFO  [hm,60000,1438368907764-GeneralBulkAssigner-19] 
master.RegionStates(826): Transition {2070d83bfd7c3fa6950c859ce842039e 
state=PENDING_OPEN, ts=1441474499424, server=hs4,60020,1441213185092} to 
{2070d83bfd7c3fa6950c859ce842039e state=OPEN, ts=1441474567073, 
server=hs4,60020,1441213185092}
2015-09-06 01:36:07,073 INFO  [hm,60000,1438368907764-GeneralBulkAssigner-19] 
master.RegionStates(371): Onlined 2070d83bfd7c3fa6950c859ce842039e on 
hs4,60020,1441213185092
2015-09-06 01:36:33,960 DEBUG [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(926): Handling RS_ZK_REGION_OPENED, 
server=hs4,60020,1441213185092, region=2070d83bfd7c3fa6950c859ce842039e, which 
is more than 15 seconds late, current_state={2070d83bfd7c3fa6950c859ce842039e 
state=OPEN, ts=1441474567073, server=hs4,60020,1441213185092}
{noformat}

5, modify regionStates again, but found that region already opened, Error, 
close region
{noformat}
2015-09-06 01:36:33,961 WARN  [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(1061): Received OPENED for 
2070d83bfd7c3fa6950c859ce842039e from hs4,60020,1441213185092 but the region 
isn't PENDING_OPEN/OPENING here: {2070d83bfd7c3fa6950c859ce842039e state=OPEN, 
ts=1441474567073, server=hs4,60020,1441213185092}
2015-09-06 01:36:33,965 DEBUG [AM.ZK.Worker-pool2-t5251] 
master.AssignmentManager(1849): Sent CLOSE to hs4,60020,1441213185092 for 
region 
NB_APP_BEHAVIOR_LABEL_000000,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
{noformat}

> NotServingRegion: hbase region closed forever
> ---------------------------------------------
>
>                 Key: HBASE-14407
>                 URL: https://issues.apache.org/jira/browse/HBASE-14407
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
>            Reporter: Shuaifeng Zhou
>            Assignee: Shuaifeng Zhou
>            Priority: Critical
>         Attachments: hs4.log, master.log
>
>
>     I found a situation may cause region closed forever, and this situation 
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the 
> problem:
>     1, master send region open to regionserver
>     2, rs open a handler do openregion
>     3, rs return resopnse to master
>     3, master not received the response, or timeout, send open region again
>     4, rs already opened the region
>     5, master processAlreadyOpenedRegion, update regionstate open in master 
> memory
>     6, master received zk message region opened(for some reason late, eg: net 
> work), and triger update regionstate open, but find that region already 
> opened, ERROR!
>     7, master send close region, and region be closed forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to