[ 
https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686000#comment-16686000
 ] 

Andrew Purtell commented on HBASE-21464:
----------------------------------------

On the regionserver doing the splitting, we get to the point where its' time to 
update meta at 18:07:35
{noformat}
2018-11-09 18:07:35,704 DEBUG 
[regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557]
 regionserver.SplitTransaction: Split storefiles for \
region 
test,user4112339446054425864,1541786730764.2802f0bfbe9e7d88d530c16539f95cfd. 
Daughter A: 6 storefiles, Daughter B: 6 storefiles.

...

2018-11-09 18:07:35,757 DEBUG 
[regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557]
 ipc.BlockingRpcConnection: Connecting to 
ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120

...

2018-11-09 18:08:14,168 INFO  
[regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557]
 client.RpcRetryingCaller: Call exception, tries=10, retries=350, started=38412 
ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: 
Region hbase:meta,,1 is not online on 
ip-172-31-5-92.us-west-2.compute.internal,8120,1541786481463{noformat}
However META recently moved. It's not on ip-172-31-5-92 any longer. It moved to 
ip-172-31-15-225 a minute prior, at 18:06:24.

>From master
{noformat}
2018-11-09 18:06:24,690 DEBUG [AM.ZK.Worker-pool5-t64] 
master.AssignmentManager: Znode hbase:meta,,1.1588230740 deleted, state: 
{1588230740 state=OPEN, ts=1541786784688, 
server=ip-172-31-15-225.us-west-2.compute.internal,8120,1541786485409}
{noformat}
>From regionserver ip-172-31-15-225:
{noformat}
2018-11-09 18:06:24,686 DEBUG [PostOpenDeployTasks:1588230740] 
regionserver.HRegionServer: Finished post open deploy task for 
hbase:meta,,1.1588230740

...

2018-11-09 18:06:24,688 DEBUG [RS_OPEN_META-ip-172-31-15-225:8120-0] 
handler.OpenRegionHandler: Opened hbase:meta,,1.1588230740 on 
ip-172-31-15-225.us-west-2.compute.internal,8120,1541786485409
{noformat}
The stuck split happens a minute later after META is redeployed and is live on 
ip-172-31-15-225. 

The relevant code attempting the update is in SplitTransactionImpl.
{code:java}
    if (!testing && useZKForAssignment) {
      if (metaEntries == null || metaEntries.isEmpty()) {
        MetaTableAccessor.splitRegion(server.getConnection(),
          parent.getRegionInfo(), daughterRegions.getFirst().getRegionInfo(),
          daughterRegions.getSecond().getRegionInfo(), server.getServerName(),
          parent.getTableDesc().getRegionReplication());
      } else {
        offlineParentInMetaAndputMetaEntries(server.getConnection(),
          parent.getRegionInfo(), daughterRegions.getFirst().getRegionInfo(), 
daughterRegions
              .getSecond().getRegionInfo(), server.getServerName(), metaEntries,
              parent.getTableDesc().getRegionReplication());
      }
{code}
(and not relevant, this bit tells the master directly if using zk-less 
assignment)
{code:java}
    } else if (services != null && !useZKForAssignment) {
      if (!services.reportRegionStateTransition(TransitionCode.SPLIT_PONR,
          parent.getRegionInfo(), hri_a, hri_b)) {
        // Passed PONR, let SSH clean it up
        throw new IOException("Failed to notify master that split passed PONR: "
          + parent.getRegionInfo().getRegionNameWithoutKeyAsString());
      }
    }
{code}
So we either call MetaTableAccessor.splitRegion here or 
MetaTableAccessor.mutateMetaTable via offlineParentInMetaAndputMetaEntries.

The question is why the connection (acquired by server.getConnection()) used by 
MetaTableAccessor is not relocating META's location.

There is one change to MetaTableAccessor between 1.4.2, which does not 
reproduce, and 1.4.3, which does reproduce the problem, but looking at it I 
can't see how it it would be related.
{noformat}
commit 0d8fee2158e08bc6d0907d4abbe1215eaded6ce3
Author: Pankaj Kumar <pankaj...@huawei.com>
Date:   Thu Dec 7 22:51:01 2017 +0530

    HBASE-19364, Truncate_preserve fails with table when replica region > 1
{noformat}
Looking at ConnectionManager now.

> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
>                 Key: HBASE-21464
>                 URL: https://issues.apache.org/jira/browse/HBASE-21464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
>            Reporter: Andrew Purtell
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.9
>
>
> Splitting is blocked during split transaction. The split worker is trying to 
> update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO  
> [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
>  client.RpcRetryingCaller: Call exception, tries=13, retries=350, 
> started=88590 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 
> is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
>      at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 
> 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 
> 'hbase:meta' at region=hbase:meta,1.1588230740, 
> hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, 
> seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] 
> client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
> parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying 
> after sleep of 20158 because: No server address listed in hbase:meta for 
> region 
> test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. 
> containing row user3301635648728421323{noformat}
> Balancing cannot run indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: 
> Not running balancer because 3 region(s) in transition: 
> [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, 
> {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute....{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to