[ https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615418#comment-15615418 ]
Yu Li commented on HBASE-16960: ------------------------------- Thanks for chiming in [~ram_krish] Note that here we already encounter an exception and set {{exception!=null}}, and according to below codes all succeeding appends will just return: {code} if (this.exception != null) { // We got an exception on an earlier attempt at append. Do not let this append // go through. Fail it but stamp the sequenceid into this append though failed. // We need to do this to close the latch held down deep in WALKey...that is waiting // on sequenceid assignment otherwise it will just hang out (The #append method // called below does this also internally). entry.stampRegionSequenceId(); // Return to keep processing events coming off the ringbuffer return; } {code} So there won't be any real append happen before a new sync truck handled by the {{RingBufferEventHandler}}, and when the new sync arrives, it will also goes to the below lines and *also* cleans all {{syncFutures}} that haven't been offered to {{SyncRunner}}: {code} // We may have picked up an exception above trying to offer sync if (this.exception != null) { cleanupOutstandingSyncsOnException(sequence, this.exception instanceof DamagedWALException? this.exception: new DamagedWALException("On sync", this.exception)); } {code} And the only difference is that this cleanup will include this new sync itself. In my understanding we just return when append fails and wait for the next sync to cleanup the syncs because we must make sure the failed append won't be synced and returned as success. But the problem in this JIRA is some case that there's no further syncs after append fails, and causing an isolated sync then infinite wait. The proposal will try to clean previous non-synced syncFutures so it won't leave any isolated one, and don't break any existing logic. Actually [~aoxiang] and I also observed more questions on whether the current implementation could assure the semantic that "failed appends won't get synced successfully", and we're still digging into it. Will open another JIRA if any solution. > RegionServer hang when aborting > ------------------------------- > > Key: HBASE-16960 > URL: https://issues.apache.org/jira/browse/HBASE-16960 > Project: HBase > Issue Type: Bug > Reporter: binlijin > Assignee: binlijin > Attachments: HBASE-16960.patch, RingBufferEventHandler.png, > RingBufferEventHandler_exception.png, SyncFuture.png, > SyncFuture_exception.png, rs1081.jstack > > > We see regionserver hang when aborting several times and cause all regions on > this regionserver out of service and then all affected applications stop > works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)