[ 
https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-21464:
-----------------------------------
    Description: 
Splitting is blocked during split transaction. The split worker is trying to 
update meta but isn't able to relocate it after NSRE:
{noformat}
2018-11-09 17:50:45,277 INFO  
[regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
 client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 
ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: 
Region hbase:meta,,1 is not online on 
ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
     at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 
'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' 
at region=hbase:meta,1.1588230740, 
hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, 
seqNum=0{noformat}
Clients, in this case YCSB, are hung with part of the keyspace missing:
{noformat}
2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] 
client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after 
sleep of 20158 because: No server address listed in hbase:meta for region 
test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. 
containing row user3301635648728421323{noformat}
Additional confirmation of the problem on the master, balancing cannot run 
indefinitely because the split transaction is stuck
{noformat}
2018-11-09 17:49:55,478 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not 
running balancer because 3 region(s) in transition: 
[{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, 
server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, 
{5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, 
server=ip-172-31-5-92.us-west-2.compute....{noformat}
Unfortunately I don't have a lot of time to debug this before heading out for 
the weekend. Will pick it up on Monday. I saved all of the cluster logs.

  was:
ITBLL tests with an internal fork of 1.4.7 looked fine, but then same with an 
internal fork of 1.4.8 showed an alarming performance problem and eventual test 
failure. Can repro with the 1.4.8 upstream release. I didn't try 1.4.7 and will 
need to do it as a sanity check but let's assume for now there is a bad bug 
introduced somewhere between 1.4.7 and 1.4.8.

Splitting is blocked when meta relocates during split transaction because the 
splitting thread does not try to relocate meta.

The split worker is trying to update meta but doesn't relocate it even after 
NSRE:
{noformat}
2018-11-09 17:50:45,277 INFO  
[regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
 client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 
ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: 
Region hbase:meta,,1 is not online on 
ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
     at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
        at 
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 
'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' 
at region=hbase:meta,1.1588230740, 
hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, 
seqNum=0{noformat}
Clients, in this case YCSB, are hung with part of the keyspace missing:
{noformat}
2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] 
client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after 
sleep of 20158 because: No server address listed in hbase:meta for region 
test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. 
containing row user3301635648728421323{noformat}
Additional confirmation of the problem on the master, balancing cannot run 
indefinitely because the split transaction is stuck
{noformat}
2018-11-09 17:49:55,478 DEBUG 
[RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not 
running balancer because 3 region(s) in transition: 
[{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, 
server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, 
{5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, 
server=ip-172-31-5-92.us-west-2.compute....{noformat}
Unfortunately I don't have a lot of time to debug this before heading out for 
the weekend. Will pick it up on Monday. I saved all of the cluster logs.


> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
>                 Key: HBASE-21464
>                 URL: https://issues.apache.org/jira/browse/HBASE-21464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
>            Reporter: Andrew Purtell
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.9
>
>
> Splitting is blocked during split transaction. The split worker is trying to 
> update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO  
> [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434]
>  client.RpcRetryingCaller: Call exception, tries=13, retries=350, 
> started=88590 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 
> is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
>      at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>         at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 
> 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 
> 'hbase:meta' at region=hbase:meta,1.1588230740, 
> hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, 
> seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] 
> client.ConnectionManager$HConnectionImplementation: locateRegionInMeta 
> parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying 
> after sleep of 20158 because: No server address listed in hbase:meta for 
> region 
> test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. 
> containing row user3301635648728421323{noformat}
> Additional confirmation of the problem on the master, balancing cannot run 
> indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: 
> Not running balancer because 3 region(s) in transition: 
> [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, 
> {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, 
> server=ip-172-31-5-92.us-west-2.compute....{noformat}
> Unfortunately I don't have a lot of time to debug this before heading out for 
> the weekend. Will pick it up on Monday. I saved all of the cluster logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to