apurtell commented on a change in pull request #2574: URL: https://github.com/apache/hbase/pull/2574#discussion_r512965703
########## File path: hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java ########## @@ -4569,13 +4665,29 @@ private void doMiniBatchMutate(BatchOperation<?> batchOp) throws IOException { // We should record the timestamp only after we have acquired the rowLock, // otherwise, newer puts/deletes/increment/append are not guaranteed to have a newer // timestamp + + // Check for thread interrupt status in case we have been signaled from + // #interruptRegionOperation. + checkInterrupt(); + long now = EnvironmentEdgeManager.currentTime(); batchOp.prepareMiniBatchOperations(miniBatchOp, now, acquiredRowLocks); // STEP 3. Build WAL edit + + // Check for thread interrupt status in case we have been signaled from + // #interruptRegionOperation. + checkInterrupt(); + List<Pair<NonceKey, WALEdit>> walEdits = batchOp.buildWALEdits(miniBatchOp); // STEP 4. Append the WALEdits to WAL and sync. + + // Check for thread interrupt status in case we have been signaled from + // #interruptRegionOperation. This is the last place we can do it "safely" before + // WAL appends. + checkInterrupt(); + Review comment: Since we are on the topic of WAL syncs.... We could protect against interrupting a sync using a couple of different approaches. The simplest in the context of this proposed change would be to remove the current handler thread from the regionLockHolders set whenever we decide interrupting it is no longer the best choice, like a point of no return on the mutation code path. startRegionOperation(Operation.PUT); // add thread to the set so it is interruptible try { ... startWALOperation(); // remove thread from the set so it will not be interrupted try { ... } finally { endWALOperation(); // add thread to the set so it is eligible to be interrupted again } ... } finally { endRegionOperation(); // remove thread from the set as it is no longer actively handling RPC } However we already have issues with sync timeouts or regionserver aborts during sync such that we are not introducing anything new here. In other words, if it is problematic that a WAL sync can be aborted before it completes, this already happens for other reasons, like timeouts (presumably because HDFS is slow), or HDFS level pipeline recovery failure, or server shutdown. Thinking about this though makes me think of the English expression "opening a can of worms". I don't know of a Chinese equivalent. I'm going to implement the proposal above out of an abundance of caution and we can decide upon further review if it makes sense. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org