[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-26 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387508#comment-17387508
 ] 

Michael Stack commented on HBASE-25984:
---

Ignore my previous comment. Bisect (or more likely the pilot) identified the 
wrong issue... it is not this that is cause of failed test.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-07-24 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386780#comment-17386780
 ] 

Michael Stack commented on HBASE-25984:
---

git bisect flags this pr as why we have 100% fail running 
TestPostIncrementAndAppendBeforeWAL (See bottom of 
[https://ci-hadoop.apache.org/view/HBase/job/HBase/job/HBase-Find-Flaky-Tests/job/branch-2.3/lastSuccessfulBuild/artifact/output/dashboard.html)]
 Let me see if can fix.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365954#comment-17365954
 ] 

Hudson commented on HBASE-25984:


Results for branch branch-1
[build #135 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/135/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/135//General_Nightly_Build_Report/]


(/) {color:green}+1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/135//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-1/135//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365837#comment-17365837
 ] 

Hudson commented on HBASE-25984:


Results for branch branch-2.4
[build #144 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/144/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/144/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/144/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/144/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/144/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 1.7.1, 2.4.5
>
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch 

[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365648#comment-17365648
 ] 

Hudson commented on HBASE-25984:


Results for branch branch-2
[build #279 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console 
output|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2/279//console].


> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress 

[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-18 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365629#comment-17365629
 ] 

Hudson commented on HBASE-25984:


Results for branch branch-2.3
[build #239 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.3/239/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the 

[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-17 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365060#comment-17365060
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

Back port PRs are WIP (PR - jira auto link isn't working, it will catch up 
soon).

https://github.com/apache/hbase/pull/3392
https://github.com/apache/hbase/pull/3393
https://github.com/apache/hbase/pull/3394

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-17 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365021#comment-17365021
 ] 

Andrew Kyle Purtell commented on HBASE-25984:
-

PR is merged. What is the status of this JIRA? Backports in progress? 

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-17 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364948#comment-17364948
 ] 

Hudson commented on HBASE-25984:


Results for branch master
[build #325 on 
builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/325/]:
 (x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/325/General_20Nightly_20Build_20Report/]






(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/325/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/master/325/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360389#comment-17360389
 ] 

Andrew Kyle Purtell commented on HBASE-25984:
-

FWIW I would prefer we use ThreadLocals only when there is no reasonable 
alternative. I don't think that bar is reached here, because as the PR 
demonstrates, a shared cache can work. I'm familiar with this work and some 
microbenchmarks done on the result (for branch-1, though) where the shared 
cache approach does not hurt performance and in fact produces a small 
performance benefit. Let me hold back on further comment until we have 
microbenchmarks comparing thread local vs shared cache approaches for async WAL.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-09 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360383#comment-17360383
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

After reading the AsyncFSWAL implementation, I think the overwrites of the 
futures are possible but it does not cause a deadlock because of the way the 
safe point is attained. I uploaded a draft patch that removes the usage of 
ThreadLocals for both the WAL implementations. It only matters for the FSHLog 
implementation but in general it seems risky to use ThreadLocals that are prone 
to overwrites. [~zhangduo] Any thoughts?

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-09 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360044#comment-17360044
 ] 

Viraj Jasani commented on HBASE-25984:
--

[~bharathv] I think you can create PR. Support to run QA with uploaded patch is 
no longer available IIRC.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
> Attachments: HBASE-25984-unit-test.patch
>
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359579#comment-17359579
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

This has been a recurring problem for us under load deadlocking a bunch of 
region servers and severely dropping the ingestion throughput until our probes 
trigger a force abort of the RS pods and the throughput recovers. I cannot 
share the heap dump for obvious reasons but can share any other details that 
reviewers might be interested in.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
> {noformat}
> On the other hand, SyncRunner threads are idle and just waiting for work 
> implying that there are no pending SyncFutures that need to be run
> {noformat}
>sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1297)
> java.lang.Thread.run(Thread.java:748)
> {noformat}
> Overall the WAL system is dead locked and could make no progress until it was 
> aborted. I got to the bottom of this issue and have a patch that can fix it 
> (more details in the comments due to word limit in the description).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25984) FSHLog WAL lockup with sync future reuse [RS deadlock]

2021-06-08 Thread Bharath Vissapragada (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359575#comment-17359575
 ] 

Bharath Vissapragada commented on HBASE-25984:
--

Obtained a heap dump and poked around quite a bit and observed that in-memory 
state is corrupted. After analyzing it and eyeballing the code (long story 
short), came up with the following theory...

This is the current implementation of SyncFuture#reset()
{noformat}
1  synchronized SyncFuture reset(final long sequence, Span span) {
2if (t != null && t != Thread.currentThread()) throw new 
IllegalStateException();
3t = Thread.currentThread();
4if (!isDone()) throw new IllegalStateException("" + sequence + " " + 
Thread.currentThread());
5this.doneSequence = NOT_DONE;
6this.ringBufferSequence = sequence;
7this.span = span;
8this.throwable = null;
9return this;
10 }
{noformat}
We can see, there are guards against overwriting un-finished sync futures with 
checks on ‘current thread’ (L2) and !isDone() (L4). These are not tight checks, 
consider the following sequence of interleaved actions that can result in a 
deadlock..
 # RPC handler#1 → WAL sync - SyncFuture#1 (from ThreadLocalCache) → Add to 
RingBuffer - state: NOT_DONE
 # RingBufferConsumer#onEvent() → SyncFuture#1 dispatch → old state: NOT_DONE, 
new state: DONE (via releaseSyncFuture() in SyncRunner.run())
 # RPCHandler#1 (reused for another region operation) → ThreadLocal 
SyncFuture#1 NOT_DONE (via reset()) (*this goes through because isDone() now 
returns true and L2 always returns true because we reuse the ThreadLocal 
instance*).
 #  RingBufferConsumer#onEvent() - attainSafePoint() -> works on the 
overwritten sync future state (from step 3)
 #   DEADLOCK === (as the sync future remains in the state forever)


The problem here is that once a SyncFuture is marked DONE, it is eligible for 
reuse. Even though ring buffer still has references to it and is checking on it 
to attain a safe point, in the background another handler can just overwrite it 
resulting in ring buffer operating on a wrong future and deadlocking the 
system. Very subtle bug.. I'm able to reproduce this in a carefully crafted 
unit test (attached). I have a fix solves this problem, will create a PR for 
review soon.

> FSHLog WAL lockup with sync future reuse [RS deadlock]
> --
>
> Key: HBASE-25984
> URL: https://issues.apache.org/jira/browse/HBASE-25984
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, wal
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.5
>Reporter: Bharath Vissapragada
>Assignee: Bharath Vissapragada
>Priority: Critical
>  Labels: deadlock, hang
>
> We use FSHLog as the WAL implementation (branch-1 based) and under heavy load 
> we noticed the WAL system gets locked up due to a subtle bug involving racy 
> code with sync future reuse. This bug applies to all FSHLog implementations 
> across branches.
> Symptoms:
> On heavily loaded clusters with large write load we noticed that the region 
> servers are hanging abruptly with filled up handler queues and stuck MVCC 
> indicating appends/syncs not making any progress.
> {noformat}
>  WARN  [8,queue=9,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=172383686, writePoint=172383690, 
> regionName=1ce4003ab60120057734ffe367667dca}
>  WARN  [6,queue=2,port=60020] regionserver.MultiVersionConcurrencyControl - 
> STUCK for : 296000 millis. 
> MultiVersionConcurrencyControl{readPoint=171504376, writePoint=171504381, 
> regionName=7c441d7243f9f504194dae6bf2622631}
> {noformat}
> All the handlers are stuck waiting for the sync futures and timing out.
> {noformat}
>  java.lang.Object.wait(Native Method)
> 
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:183)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.blockOnSync(FSHLog.java:1509)
> .
> {noformat}
> Log rolling is stuck because it was unable to attain a safe point
> {noformat}
>java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1799)
>  
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:900)
> {noformat}
> and the Ring buffer consumer thinks that there are some outstanding syncs 
> that need to finish..
> {noformat}
>   
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2031)
> 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999)
> 
>