[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Vissapragada updated HBASE-25984:
-----------------------------------------
    Release Note: Fixes a WAL lockup issue due to premature reuse of the sync 
futures by the WAL consumers. The lockup causes the WAL system to hang 
resulting in blocked appends and syncs thus holding up the RPC handlers from 
progressing. Only workaround without this fix is to force abort the region 
server.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> ------------------------------------------------------
>
>                 Key: HBASE-25984
>                 URL: https://issues.apache.org/jira/browse/HBASE-25984
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>            Reporter: Bharath Vissapragada
>            Assignee: Bharath Vissapragada
>            Priority: Critical
>              Labels: deadlock, hang
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
>         Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
>     
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
>     
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
>     .....
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>    java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
>     
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
>     
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>    sun.misc.Unsafe.park(Native Method)
>     java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>     
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
>     java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to