Michael Stack created HBASE-23984:
-------------------------------------

             Summary: [Flakey Tests] TestMasterAbortAndRSGotKilled fails in 
teardown
                 Key: HBASE-23984
                 URL: https://issues.apache.org/jira/browse/HBASE-23984
             Project: HBase
          Issue Type: Bug
            Reporter: Michael Stack


Its failing with decent frequency of late in shutdown of cluster. Seems basic. 
There is an unassign/move going on. Test just checks Master can come back up 
after being killed. Does not check move is done. If on subsequent cluster 
shutdown, if the move can't report the Master because its shutting down, then 
the move fails, we abort the server, and then we get a wonky loop where we 
can't close because server is aborting.

At the root, there is a misaccounting when the unassign close fails where we 
don't cleanup references in the regionserver local RIT accounting. Deeper than 
this, close code is duplicated in three places that I can see; in RegionServer, 
in CloseRegionHandler, and in UnassignRegionHandler.

Let me fix this issue and the code dupe.

Details:

>From 
>https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt

Here is the unassign handler failing because master went down earlier (Its 
probably trying to talk to the old Master location)

{code}
***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: 
Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover 
*****
Cause:
java.io.IOException: Failed to report close to master: 
ede67f9f661acc1241faf468b081d548
        at 
org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

... then the cluster shutdown tries to close the same Region... but fails 
because we are aborting because of above.... 

{code}
2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] 
helpers.MarkerIgnoringBase(159): ***** ABORTING region server 
asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while 
closing region 
hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still 
finishing close *****
java.io.IOException: Aborting flush because server is aborted...
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
        at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
        at 
org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

....

And the RS keeps looping trying to close the Region even though we're aborted 
and there is handling in RS close Regions to deal with abort.

Trouble seems to be because when UnassignRegionHandler fails its region close, 
it does not unregister the Region with 
rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to