apurtell commented on pull request #2574:
URL: https://github.com/apache/hbase/pull/2574#issuecomment-716699603


   I have found one test where interrupt by default causes a repeatable 
problem. 
   
   TestSyncReplicationActive
   
   
       [ERROR]   TestSyncReplicationActive.testActive:99 
       Expected: a string containing "only marker edit is allowed"
        but: was "Failed after attempts=1, exceptions:
       2020-10-25T02:46:59.367Z, java.io.InterruptedIOException: 
java.io.InterruptedIOException
           at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.convertInterruptedExceptionToIOException(AbstractFSWAL.java:878)
           at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:866)
           at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:710)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:9031)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8624)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4674)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4594)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4522)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4992)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4987)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4983)
           at 
org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3302)
           at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3031)
           at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2994)
           at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45251)
           at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:397)
           at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
           at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
           at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
       Caused by: java.lang.InterruptedException
           at java.lang.Object.wait(Native Method)
           at 
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142)
           at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:855)
           ... 17 more
   
   Thinking about what to do here it occurred to me we do not need to be so 
greedy about interrupting handlers by default. The original motivation for 
interrupting RPCs in flight was to address the case where we get stuck closing 
the region, so we can be less aggressive and wait until we actually seem to be 
stuck. In an early version of this patch the tryLock was attempted in a loop 
that would wait the entire configured wait interval before triggering the 
abort. Rightfully so @bharathv provided feedback the loop wasn't providing any 
advantage, especially considering we crash the RS if interrupted, but I think 
we should bring this back to do this:
   
       waitTime = <some significant fraction of total wait interval>
       do {
           start = EnvironmentEdgeManager.getCurrentTime();
           acquired = tryLock(waitTime, TimeUnit.MILLISECONDS);
           end = EnvironmentEdgeManager.getCurrentTime();
           totalWaitTime += end - start;
           waitTime -= end - start;
           if (!acquired) {
               interruptRegionOperations();
           }
       } while (!acquired && waitTime > 0);
   
   This will cause us to begin interrupting region lock holders only if we have 
already waited for some significant fraction of the total wait interval. This 
also has the benefit (IMHO) of potentially issuing more than one interrupt to a 
handler if the earlier interrupt was somehow swallowed by code we don't 
control, like a Hadoop library, or in the HDFS client. 
   
   We do want to interrupt even things like WAL append if the handler is 
holding us up closing the region, but upon reflection I do not believe we 
should be so aggressive to proactively issue interrupts immediately when we 
want to close. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to