[ 
https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705565#comment-16705565
 ] 

Andrew Purtell edited comment on HBASE-21464 at 12/1/18 1:43 AM:
-----------------------------------------------------------------

I don't think recursive region relocation works the way we are all expecting, 
that when we NSRE on meta we will always end up in ConnectionManager#locateMeta 
with useCache = false. The sum of recursive region relocation code is hard to 
understand and should be rewritten. I'm not going to do that today. What I do 
have is a patch that works reliably to fix the issue in my test environment 
when meta is moved during split activity while preserving the intents of 
HBASE-10785 (don't overload zookeeper with lookups by looking up meta every 
time) and HBASE-19260 (don't overload zookeeper with unnecessary concurrent 
lookups). There is a new limit on cache entry age for meta, hardcoded to 10 
seconds (should it be configurable? I don't think it matters much...), to 
prevent getting stuck on a stale meta location. Consider it a safety valve we 
need while continuing to look at this problem.

How to reproduce:
 * Run a load client. I use YCSB with 100 threads. The test table is named 
"test".
 * In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test'; 
compact 'test' ; split 'test' ; balancer ; done

You've hit the problem when the result of the shell 'balancer' command is 
always false. Go to the master, you'll find a split in progress that can't 
finish. Go to the regionserver attempting the split and you'll find the split 
worker going back again and again to the regionserver no longer hosting meta 
looking for meta.


was (Author: apurtell):
I don't think recursive region relocation works the way we are all expecting, 
that when we NSRE on meta we will always end up in 
ConnectionManager#locateRegion with useCache = false. The sum of recursive 
region relocation code is hard to understand and should be rewritten. I'm not 
going to do that today. What I do have is a patch that works reliably to fix 
the issue in my test environment when meta is moved during split activity while 
preserving the intents of HBASE-10785 (don't overload zookeeper with lookups by 
looking up meta every time) and HBASE-19260 (don't overload zookeeper with 
unnecessary concurrent lookups). There is a new limit on cache entry age for 
meta, hardcoded to 10 seconds (should it be configurable? I don't think it 
matters much...), to prevent getting stuck on a stale meta location. Consider 
it a safety valve we need while continuing to look at this problem.

How to reproduce:
 * Run a load client. I use YCSB with 100 threads. The test table is named 
"test".
 * In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test'; 
compact 'test' ; split 'test' ; balancer ; done

You've hit the problem when the result of the shell 'balancer' command is 
always false. Go to the master, you'll find a split in progress that can't 
finish. Go to the regionserver attempting the split and you'll find the split 
worker going back again and again to the regionserver no longer hosting meta 
looking for meta.

> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
>                 Key: HBASE-21464
>                 URL: https://issues.apache.org/jira/browse/HBASE-21464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.9
>
>         Attachments: HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch, 
> HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch
>
>
> Splitting is blocked during split transaction. The split worker is trying to 
> update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO  
> [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
>  client.RpcRetryingCaller: Call exception, tries=13, retries=350, 
> started=88590 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 
> is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
>      at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 
> 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 
> 'hbase:meta' at region=hbase:meta,1.1588230740, 
> hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, 
> seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] 
> client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
> parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying 
> after sleep of 20158 because: No server address listed in hbase:meta for 
> region 
> test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. 
> containing row user3301635648728421323{noformat}
> Balancing cannot run indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: 
> Not running balancer because 3 region(s) in transition: 
> [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, 
> {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute....{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to