[
https://issues.apache.org/jira/browse/HBASE-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744866#comment-14744866
]
Shuaifeng Zhou commented on HBASE-14407:
----------------------------------------
In master log, the error is happened like this(0.98.10):
1, master open region timeout:
{noformat}
2015-09-06 01:35:59,521 DEBUG [hm,60000,1438368907764-GeneralBulkAssigner-19]
master.AssignmentManager(1768): Bulk assigner openRegion() to
hs4,60020,1441213185092 has timed out, but the regions might already be opened
on it.
java.net.SocketTimeoutException: Call to hs4/15.173.0.115:60020 failed because
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/15.173.0.110:33771 remote=hs4/15.173.0.115:60020]
{noformat}
2, master retried open region, found that region already opened, delete node
{noformat}
2015-09-06 01:36:07,063 DEBUG [hm,60000,1438368907764-GeneralBulkAssigner-19]
master.AssignmentManager(2293): ALREADY_OPENED
NB_APP_BEHAVIOR_LABEL_000000,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
to hs4,60020,1441213185092
{noformat}
3, delete zk node in processAlreadyOpenedRegion, but region state not match
(because this is a retry, region opened previously )
{noformat}
2015-09-06 01:36:07,073 WARN [hm,60000,1438368907764-GeneralBulkAssigner-19]
zookeeper.ZKAssign(458): master:60000-0x24ee432542d01eb,
quorum=hs5:2181,hs4:2181,hm:2181, baseZNode=/hbase Attempting to delete
unassigned node 2070d83bfd7c3fa6950c859ce842039e in M_ZK_REGION_OFFLINE state
but node is in RS_ZK_REGION_OPENED state
2015-09-06 01:36:07,073 INFO [hm,60000,1438368907764-GeneralBulkAssigner-19]
master.AssignmentManager(3614): Failed to delete the offline node for
2070d83bfd7c3fa6950c859ce842039e. The node type may not match
{noformat}
at the same time, will modiry regionStates in master memory:
{code:title=AssignmentManager.java|borderStyle=solid}
private void processAlreadyOpenedRegion(HRegionInfo region, ServerName sn) {
// Remove region from in-memory transition and unassigned node from ZK
// While trying to enable the table the regions of the table were
// already enabled.
LOG.debug("ALREADY_OPENED " + region.getRegionNameAsString()
+ " to " + sn);
String encodedName = region.getEncodedName();
deleteNodeInStates(encodedName, "offline", sn,
EventType.M_ZK_REGION_OFFLINE);
regionStates.regionOnline(region, sn);
}
{code}
4, handling previous success open region zk event (delayed)
{noformat}
2015-09-06 01:36:07,073 INFO [hm,60000,1438368907764-GeneralBulkAssigner-19]
master.RegionStates(826): Transition {2070d83bfd7c3fa6950c859ce842039e
state=PENDING_OPEN, ts=1441474499424, server=hs4,60020,1441213185092} to
{2070d83bfd7c3fa6950c859ce842039e state=OPEN, ts=1441474567073,
server=hs4,60020,1441213185092}
2015-09-06 01:36:07,073 INFO [hm,60000,1438368907764-GeneralBulkAssigner-19]
master.RegionStates(371): Onlined 2070d83bfd7c3fa6950c859ce842039e on
hs4,60020,1441213185092
2015-09-06 01:36:33,960 DEBUG [AM.ZK.Worker-pool2-t5251]
master.AssignmentManager(926): Handling RS_ZK_REGION_OPENED,
server=hs4,60020,1441213185092, region=2070d83bfd7c3fa6950c859ce842039e, which
is more than 15 seconds late, current_state={2070d83bfd7c3fa6950c859ce842039e
state=OPEN, ts=1441474567073, server=hs4,60020,1441213185092}
{noformat}
5, modify regionStates again, but found that region already opened, Error,
close region
{noformat}
2015-09-06 01:36:33,961 WARN [AM.ZK.Worker-pool2-t5251]
master.AssignmentManager(1061): Received OPENED for
2070d83bfd7c3fa6950c859ce842039e from hs4,60020,1441213185092 but the region
isn't PENDING_OPEN/OPENING here: {2070d83bfd7c3fa6950c859ce842039e state=OPEN,
ts=1441474567073, server=hs4,60020,1441213185092}
2015-09-06 01:36:33,965 DEBUG [AM.ZK.Worker-pool2-t5251]
master.AssignmentManager(1849): Sent CLOSE to hs4,60020,1441213185092 for
region
NB_APP_BEHAVIOR_LABEL_000000,F\x180100\x18,1441015264846.2070d83bfd7c3fa6950c859ce842039e.
{noformat}
> NotServingRegion: hbase region closed forever
> ---------------------------------------------
>
> Key: HBASE-14407
> URL: https://issues.apache.org/jira/browse/HBASE-14407
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 0.98.10, 1.2.0, 1.1.2, 1.3.0
> Reporter: Shuaifeng Zhou
> Assignee: Shuaifeng Zhou
> Priority: Critical
> Attachments: hs4.log, master.log
>
>
> I found a situation may cause region closed forever, and this situation
> happend usually on my cluster, version is 0.98.10, but 1.1.2 also have the
> problem:
> 1, master send region open to regionserver
> 2, rs open a handler do openregion
> 3, rs return resopnse to master
> 3, master not received the response, or timeout, send open region again
> 4, rs already opened the region
> 5, master processAlreadyOpenedRegion, update regionstate open in master
> memory
> 6, master received zk message region opened(for some reason late, eg: net
> work), and triger update regionstate open, but find that region already
> opened, ERROR!
> 7, master send close region, and region be closed forever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)