apurtell commented on pull request #2574: URL: https://github.com/apache/hbase/pull/2574#issuecomment-716699603
I have found one test where interrupt by default causes a repeatable problem. TestSyncReplicationActive [ERROR] TestSyncReplicationActive.testActive:99 Expected: a string containing "only marker edit is allowed" but: was "Failed after attempts=1, exceptions: 2020-10-25T02:46:59.367Z, java.io.InterruptedIOException: java.io.InterruptedIOException at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.convertInterruptedExceptionToIOException(AbstractFSWAL.java:878) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:866) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:710) at org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:9031) at org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:8624) at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4674) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4594) at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4522) at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4992) at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4987) at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:4983) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3302) at org.apache.hadoop.hbase.regionserver.RSRpcServices.put(RSRpcServices.java:3031) at org.apache.hadoop.hbase.regionserver.RSRpcServices.mutate(RSRpcServices.java:2994) at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45251) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:397) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:855) ... 17 more Thinking about what to do here it occurred to me we do not need to be so greedy about interrupting handlers by default. The original motivation for interrupting RPCs in flight was to address the case where we get stuck closing the region, so we can be less aggressive and wait until we actually seem to be stuck. In an early version of this patch the tryLock was attempted in a loop that would wait the entire configured wait interval before triggering the abort. Rightfully so @bharathv provided feedback the loop wasn't providing any advantage, especially considering we crash the RS if interrupted, but I think we should bring this back to do this: waitTime = <some significant fraction of total wait interval> do { start = EnvironmentEdgeManager.getCurrentTime(); acquired = tryLock(waitTime, TimeUnit.MILLISECONDS); end = EnvironmentEdgeManager.getCurrentTime(); totalWaitTime += end - start; waitTime -= end - start; if (!acquired) { interruptRegionOperations(); } } while (!acquired && waitTime > 0); This will cause us to begin interrupting region lock holders only if we have already waited for some significant fraction of the total wait interval. This also has the benefit (IMHO) of potentially issuing more than one interrupt to a handler if the earlier interrupt was somehow swallowed by code we don't control, like a Hadoop library, or in the HDFS client. We do want to interrupt even things like WAL append if the handler is holding us up closing the region, but upon reflection I do not believe we should be so aggressive to proactively issue interrupts immediately when we want to close. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org