[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Yechao Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880112#comment-16880112
 ] 

Yechao Chen commented on HBASE-22665:
-

check the the code and jstack ,

 

the wal log roller stuck ,so the .AbstractFSWAL.shutdown wait the 
AbstractFSWAL.rollWriter

with the lock rollwriterLock.lock 

 

it seems like that:

1、AbstractFSWAL.rollWriter  called rollWriterLock.lock();

2、AsyncFSWAL.doReplaceWriter called waitForSafePoint();

3、waitForSafePoint() can't finished 

4、AbstractFSWAL.shutdown called rollWriterLock.lock();(waiting)

5、The rs process can't be aborted

 

 

 

 

 

with {color:#FF}at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:628){color}

"regionserver/hbase-slave-216-99:16020.logRoller" #297 daemon prio=5 os_prio=0 
tid=0x7f202a4952c0 nid=0x34c2 waiting on condition [0x7f0fdd19f000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f18d60b93a8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
 {color:#FF}at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:628){color}
 at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:656)
 at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124)
 at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:699)
 at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:759)
 at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:184)
 at java.lang.Thread.run(Thread.java:745)

 

"regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f18a49b2bb8> (a 
java.util.concurrent.locks.ReentrantLock$FairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
 at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
 at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815)
 at 
org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
 at 
org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
 at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
 at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
 at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117)
 at java.lang.Thread.run(Thread.java:745)

 

 

!image-2019-07-08-16-07-37-664.png!

 

!image-2019-07-08-16-08-26-777.png!

!image-2019-07-08-16-14-43-455.png!

 

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:

[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880170#comment-16880170
 ] 

Duo Zhang commented on HBASE-22665:
---

I think the WAL has already been broken before shutting down... As this message

{noformat}
Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get 
sync result after 30 ms for txid=36380334, WAL system stuck?
{noformat}

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880331#comment-16880331
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

{quote}RegionServer aborted failed when AbstractFSWAL.shutdown hang
{quote}
So the RS process hangs forever and never completes shutdown, [~chenyechao]?

{quote}I think the WAL has already been broken before shutting down... As this 
message{quote}
In this case, shouldn't we had handled that on 
[AsyncFSWAL.syncFailed|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L304]?
 I guess that would allow _waitForSafePoint_ to finish, and, consequently,  
_rollWriter_. which would release _rollWriterLock_ for _shutdown_. 

[~chenyechao], would you have the full RS log covering the period when this was 
observed?

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Yechao Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880403#comment-16880403
 ] 

Yechao Chen commented on HBASE-22665:
-

Thanks for reply [~Apache9][~wchevreuil]

 
{quote}So the RS process hangs forever and never completes shutdown?
{quote}

Yes,the RS process hangs forever and never finish shutdown before kill it 

bq. In this case, shouldn't we had handled that on AsyncFSWAL.syncFailed? I 
guess that would allow waitForSafePoint to finish, and, consequently, 
rollWriter. which would release rollWriterLock for shutdown.
bq. 
bq. Yechao Chen, would you have the full RS log covering the period when this 
was observed?

the log has uploaded, about the time "16:11:15.074" (the log date time was lost 
by error log  config  ,just ignore it) 
the log print error "HRegionServer - * ABORTING region server... "
before that ,there is a lot of wal "Slow sync" or "sync failed" ,and "Large 
batch operation detected " and gc pause is long (aoout 10 seconds many times),  
data hotspot in the region  of this rs.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880645#comment-16880645
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

Thanks [~chenyechao]!

My theory here is that the sync failure from "16:06:06" shown in the error 
below let to _AsyncFSWAL.unackedAppends_ size be left indefinitely > 0, as 
[_AsyncFSWAL.syncFailed_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L291]
 never clears it out. Then, when _LogRoller_ triggers the log roll, reaching 
[_AsyncFSWAL.waitForSafePoint_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L627],
 which triggers new [consumer 
thread|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487]
 that will awake back the awaiting condition only if 
_AsyncFSWAL.unackedAppends_ is empty.

[~Apache9], would you think we should clean out _AsyncFSWAL.unackedAppends_ 
when handling sync failed? Or maybe add extra check in this 
_[AsyncFSWAL.consume|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487]_
 method?

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-08 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880897#comment-16880897
 ] 

Duo Zhang commented on HBASE-22665:
---

{quote}
Duo Zhang, would you think we should clean out AsyncFSWAL.unackedAppends when 
handling sync failed?
{quote}

We haven't done this? IIRC we will move the entries from unackedAppends to 
toWriteAppends in syncFailed...

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881077#comment-16881077
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

{quote}We haven't done this? IIRC we will move the entries from unackedAppends 
to toWriteAppends in syncFailed...
{quote}
Apparently, in 
[_syncFailed_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L314]
 now, we just iterate over current _unackedAppends_, adding its elements to  
_toWriteAppends_, but we never clean _unackedAppends_. From what I can see in 
the code, we only remove elements from _unackedAppends_ in 
[_finishSync_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L416].
 

Looks like an edge case, where we syncFailed and no sync has ever been 
triggered after that (maybe syncers are aborted? I need to analyse that further 
yet), then log roll went up, LogRoller entered that _waitForSafePoint_ method, 
it triggered the consumer thread in WAITING_ROLL state, reaching [this 
condition|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487],
 which is false, thus never notifying LogRoller thread.

I'm not sure we can simply clear _unackedAppends_ in _syncFailed_, as some 
other methods are relying on this condition to decide what to do. Also, that 
could allow a log roll to happen when we still have entries to be synced on a 
present file. 

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-09 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881118#comment-16881118
 ] 

Duo Zhang commented on HBASE-22665:
---

Oh, you are right.

Let me think how to fix this. Need to add more comments.

Thanks a lot.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881338#comment-16881338
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

I guess we can assume it's fine to sign the _readyForRollingCond_ if:
{code:java}
writerBroken(currentEpochAndState) and waitingRoll(currentEpochAndState){code}
After all, the given writer won't be able to _sync_ anymore, right? Right now, 
we are just checking on the [_writerBroken_ condition 
first|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L479],
 ending consumer thread without doing nothing. Thus, if no more writes happen, 
_LogRoller_ thread may never move on.

In the hypothetical situation I wondered about in my previous messages, where 
1) sync failed, then 2) LogRoller kicked out, we have two state values that 
currently will prevent _waitForSafePoint_ from being awakened: 1) 
_writerBroken_ is true and 2) _unackedAppends_ not empty.

We may get this fixed by change [this 
block|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L479]
 as follows:
{code:java}
if (writerBroken(currentEpochAndState)&&!waitingRoll(currentEpochAndState)) {
  return;
}
if (waitingRoll(currentEpochAndState)) {
  if (writer.getLength() > fileLengthAtLastSync) {
// issue a sync
sync(writer);
  } else {
if (unackedAppends.isEmpty()||writerBroken(currentEpochAndState)) {
  readyForRolling = true;
  readyForRollingCond.signalAll();
}
  }
  return;
}
{code}

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-09 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881637#comment-16881637
 ] 

Duo Zhang commented on HBASE-22665:
---

Yes, it should be something like this [~wchevreuil]. But I want to write a UT 
to reproduce it first... Give me some time.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882178#comment-16882178
 ] 

Duo Zhang commented on HBASE-22665:
---

Oh, the waitingRoll state is modified inside the consumeLock so there is no 
race...

Let's assume the writer is broken.
If syncFailed is executed first, then it will mark the writer as broken, so in 
the waitForSafePoint method, it will return at the writerBroken check.
If waitForSafePoint is executed first, then it will set the epochAndState to 
waitingRoll, so in the syncFailed method, we will wake up the threads waiting 
on readyForRollingCond.



> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882191#comment-16882191
 ] 

Duo Zhang commented on HBASE-22665:
---

The assumption here is that, after the waitForSafePoint call there will always 
be a consume call.

If the syncFailed is happened at the beginning, the waitForSafePoint will 
return without waiting on the condition.

If the syncFailed is happened in the middle, it will wake up the 
waitForSafePoint method, and the consume method will write to the new writer.

If the syncFailed is happened at the end, the consume method will return after 
the writerBroken check, and then the syncFailed method will wake up the 
waitForSafePoint method.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882198#comment-16882198
 ] 

Duo Zhang commented on HBASE-22665:
---

And I also see some IllegalStateException in the log, which means we call close 
before calling flush, which is really strange...

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-10 Thread Yechao Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882211#comment-16882211
 ] 

Yechao Chen commented on HBASE-22665:
-

more log was updated by rs.log_part2.zip which from 00:00-17:32(has container 
log.part1)

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-10 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882290#comment-16882290
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

Good summary, [~Apache9]! Agree with your assumptions here, but what if log 
roll reaches _waitForSafepoint_ when _writerBroken_ is false, but no syncs ever 
happen anymore? Maybe something like:

1) _syncFailed_ -> writerBroken is set to true -> no waiting for rolling yet -> 
log roll is triggered by _syncFailed_ 
[here|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L320];

2) log roll triggered by syncFailed will get _waitForSafepoint,_ but since 
_writerBroken_ is true, it doesn't wait;

3) log roll eventually reaches [this 
point|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L673]
 where _writerBroken_ state gets set back to false;

4) A periodic log roll is triggered, now this one will reach _waitForSafepoint_ 
with _writerBroken_ set to false, entering the waiting condition and with no 
additional syncs, never leaving it;

 

Unfortunately, this code path does not add much logging messages that would 
give more solid evidences on this.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882799#comment-16882799
 ] 

Duo Zhang commented on HBASE-22665:
---

After 3), the new writer is up and we will schedule a new consume task to write 
the pending entries out, this is why we add the unackedAppends back to 
toWriteAppends, as we need to write them to the new writer. This will lead to a 
new sync.

Anyway, will try to write an UT to fail the sync multiple times(maybe 1000) and 
also trigger log rollings, to see if we can finish without hanging.

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-11 Thread binlijin (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882803#comment-16882803
 ] 

binlijin commented on HBASE-22665:
--

 Add the unackedAppends back to toWriteAppends, will make some edit log write 
to two hlogs, and when do log split/log recovery there may be encounter error 
log like :

2019-05-29 02:42:19,705 ERROR [RS_OPEN_REGION-regionserver/hbaseperf-9:16020-5] 
regionserver.HRegion: 8630a91ed22487e2e82b1d59b20ce4a4 : Found decreasing 
SeqId. PreId=9442752 key=mk:me/8630a91ed22487e2e82b1d59b20ce4a4/9442748; 
edit=[#edits: 1 = 
<546FA181C16-625D-4B98-A32D-BAA7F82B619E41559068398948/o:log/1559068398948/Put/vlen=1352/seqid=0;
 >]



> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882821#comment-16882821
 ] 

Duo Zhang commented on HBASE-22665:
---

This is expected behavior. We can not just throw exception to upper layer as we 
do not know if it has been written out, so the only way is to retry forever...

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-11 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882852#comment-16882852
 ] 

Wellington Chevreuil commented on HBASE-22665:
--

{quote}After 3), the new writer is up and we will schedule a new consume task 
to write the pending entries out, this is why we add the unackedAppends back to 
toWriteAppends, as we need to write them to the new writer. This will lead to a 
new sync.
{quote}
Yep, and even if this new sync fails, as the stream is already damaged, per the 
log below, only way we could reach this deadlock is if it enters [this 
condition|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L297]
 on _syncFailed_, but I think the epoch would always be the same.

{code:java}
16:06:06.466 [AsyncFSWAL-0] WARN  
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL - sync failed
java.io.IOException: stream already broken
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:142)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:351) 
~[hbase-server-2.1.2-**.jar:2.1.2-**]
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:534)
 ~[hbase-server-2.1.2-**.jar:2.1.2-**]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_112]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
{code}

{quote}Anyway, will try to write an UT to fail the sync multiple times(maybe 
1000) and also trigger log rollings, to see if we can finish without 
hanging.{quote}
Sounds like a hard to reproduce one. May need to forcibly close underlying OS 
as well?

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message 

[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2019-07-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883018#comment-16883018
 ] 

Duo Zhang commented on HBASE-22665:
---

At least I must find a another bug with the new UT...

{noformat}
Exception in thread "AsyncFSWAL-0" java.lang.AssertionError
at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.append(AbstractFSWAL.java:1009)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync(AsyncFSWAL.java:442)
at 
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:567)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

> RegionServer abort failed when AbstractFSWAL.shutdown hang
> --
>
> Key: HBASE-22665
> URL: https://issues.apache.org/jira/browse/HBASE-22665
> Project: HBase
>  Issue Type: Bug
> Environment: HBase 2.1.2
> Hadoop 3.1.x
> centos 7.4
>Reporter: Yechao Chen
>Priority: Major
> Attachments: HBASE-22665-UT.patch, image-2019-07-08-16-07-37-664.png, 
> image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, 
> jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, 
> rs.log_part2.zip
>
>
> We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like 
> "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to 
> get sync result after 30 ms for txid=36380334, WAL system stuck?"
>  
> RegionServer aborted failed when AbstractFSWAL.shutdown hang
>  
> jstack info always show the regionserver hang with "AbstractFSWAL.shutdown"
> "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 
> tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f18a49b2bb8> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>  at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
>  {color:#FF}at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color}
> {color:#FF} at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color}
>  at 
> org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168)
>  at 
> org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221)
>  at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445)
>  {color:#FF}at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color}
> {color:#FF} at java.lang.Thread.run(Thread.java:745){color}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang

2020-11-17 Thread Andrey Elenskiy (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233860#comment-17233860
 ] 

Andrey Elenskiy commented on HBASE-22665:
-

Hitting exactly the same issue on 2.2.4. The issue was exposed when the host 
was under massive load causing various pauses and OOMs across the OS.
 All the handlers are stuck on SyncFuture.get() while AsyncFSWal thread is just 
waiting on condition. Also, regionserver didn't trigger 
AbstractFSWAL.shutdown() initially and was stuck without it. Only once I 
connected with jdb to check the state, the shutdown() was called and it was 
stuck exiting until a timeout killed the process (prior to that it was stuck 
appending). Here is how the handlers look like:
{code:java}
Thread 48 (RpcServer.default.FPBQ.Fifo.handler=7,queue=1,port=16201):
 State: TIMED_WAITING
 Blocked count: 1193588
 Waited count: 2105228
 Stack:
 java.lang.Object.wait(Native Method)
 org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142)
 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:722)
 org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:637)
 org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8588)
 org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:7946)
 
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4130)
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4072)
 org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4003)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1042)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:974)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:937)
 
org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2755)
 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42290)
 org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
 org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
 org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
 org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318){code}
How AsynFSWal looks like:
{code:java}
Thread 164 (AsyncFSWAL-0):
 State: WAITING
 Blocked count: 6198
 Waited count: 32255423
 Waiting on 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@bf1c663
 Stack:
 sun.misc.Unsafe.park(Native Method)
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 java.lang.Thread.run(Thread.java:748){code}
There were lots of logs like this right before it got stuck:

 
{code:java}
2020-11-17 14:37:53,902 WARN [AsyncFSWAL-0] wal.AsyncFSWAL: sync failed
java.io.IOException: Connection to 192.168.2.23/192.168.2.23:15010 closed
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:289)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:236)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:99)
 at 
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:288)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
 at 
org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228)
 at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221)
 at 
org.apache.hbase.thirdparty.io.ne