[jira] [Commented] (HBASE-16960) RegionServer hang when aborting

Yu Li (JIRA) Fri, 28 Oct 2016 06:32:28 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-16960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15615418#comment-15615418
 ]


Yu Li commented on HBASE-16960:
-------------------------------

Thanks for chiming in [~ram_krish]

Note that here we already encounter an exception and set {{exception!=null}}, 
and according to below codes all succeeding appends will just return:
{code}
            if (this.exception != null) {
              // We got an exception on an earlier attempt at append. Do not 
let this append
              // go through. Fail it but stamp the sequenceid into this append 
though failed.
              // We need to do this to close the latch held down deep in 
WALKey...that is waiting
              // on sequenceid assignment otherwise it will just hang out (The 
#append method
              // called below does this also internally).
              entry.stampRegionSequenceId();
              // Return to keep processing events coming off the ringbuffer
              return;
            }
{code}

So there won't be any real append happen before a new sync truck handled by the 
{{RingBufferEventHandler}}, and when the new sync arrives, it will also goes to 
the below lines and *also* cleans all {{syncFutures}} that haven't been offered 
to {{SyncRunner}}:
{code}
        // We may have picked up an exception above trying to offer sync
        if (this.exception != null) {
          cleanupOutstandingSyncsOnException(sequence,
            this.exception instanceof DamagedWALException?
              this.exception:
              new DamagedWALException("On sync", this.exception));
        }
{code}
And the only difference is that this cleanup will include this new sync itself.

In my understanding we just return when append fails and wait for the next sync 
to cleanup the syncs because we must make sure the failed append won't be 
synced and returned as success. But the problem in this JIRA is some case that 
there's no further syncs after append fails, and causing an isolated sync then 
infinite wait. The proposal will try to clean previous non-synced syncFutures 
so it won't leave any isolated one, and don't break any existing logic.

Actually [~aoxiang] and I also observed more questions on whether the current 
implementation could assure the semantic that "failed appends won't get synced 
successfully", and we're still digging into it. Will open another JIRA if any 
solution.

> RegionServer hang when aborting
> -------------------------------
>
>                 Key: HBASE-16960
>                 URL: https://issues.apache.org/jira/browse/HBASE-16960
>             Project: HBase
>          Issue Type: Bug
>            Reporter: binlijin
>            Assignee: binlijin
>         Attachments: HBASE-16960.patch, RingBufferEventHandler.png, 
> RingBufferEventHandler_exception.png, SyncFuture.png, 
> SyncFuture_exception.png, rs1081.jstack
>
>
> We see regionserver hang when aborting several times and cause all regions on 
> this regionserver out of service and then all affected applications stop 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-16960) RegionServer hang when aborting

Reply via email to