[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880112#comment-16880112 ] Yechao Chen commented on HBASE-22665: - check the the code and jstack , the wal log roller stuck ,so the .AbstractFSWAL.shutdown wait the AbstractFSWAL.rollWriter with the lock rollwriterLock.lock it seems like that: 1、AbstractFSWAL.rollWriter called rollWriterLock.lock(); 2、AsyncFSWAL.doReplaceWriter called waitForSafePoint(); 3、waitForSafePoint() can't finished 4、AbstractFSWAL.shutdown called rollWriterLock.lock();(waiting) 5、The rs process can't be aborted with {color:#FF}at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:628){color} "regionserver/hbase-slave-216-99:16020.logRoller" #297 daemon prio=5 os_prio=0 tid=0x7f202a4952c0 nid=0x34c2 waiting on condition [0x7f0fdd19f000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f18d60b93a8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) {color:#FF}at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.waitForSafePoint(AsyncFSWAL.java:628){color} at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:656) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doReplaceWriter(AsyncFSWAL.java:124) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:699) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:759) at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:184) at java.lang.Thread.run(Thread.java:745) "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f18a49b2bb8> (a java.util.concurrent.locks.ReentrantLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815) at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) at org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) at org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117) at java.lang.Thread.run(Thread.java:745) !image-2019-07-08-16-07-37-664.png! !image-2019-07-08-16-08-26-777.png! !image-2019-07-08-16-14-43-455.png! > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880170#comment-16880170 ] Duo Zhang commented on HBASE-22665: --- I think the WAL has already been broken before shutting down... As this message {noformat} Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=36380334, WAL system stuck? {noformat} > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880331#comment-16880331 ] Wellington Chevreuil commented on HBASE-22665: -- {quote}RegionServer aborted failed when AbstractFSWAL.shutdown hang {quote} So the RS process hangs forever and never completes shutdown, [~chenyechao]? {quote}I think the WAL has already been broken before shutting down... As this message{quote} In this case, shouldn't we had handled that on [AsyncFSWAL.syncFailed|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L304]? I guess that would allow _waitForSafePoint_ to finish, and, consequently, _rollWriter_. which would release _rollWriterLock_ for _shutdown_. [~chenyechao], would you have the full RS log covering the period when this was observed? > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880403#comment-16880403 ] Yechao Chen commented on HBASE-22665: - Thanks for reply [~Apache9][~wchevreuil] {quote}So the RS process hangs forever and never completes shutdown? {quote} Yes,the RS process hangs forever and never finish shutdown before kill it bq. In this case, shouldn't we had handled that on AsyncFSWAL.syncFailed? I guess that would allow waitForSafePoint to finish, and, consequently, rollWriter. which would release rollWriterLock for shutdown. bq. bq. Yechao Chen, would you have the full RS log covering the period when this was observed? the log has uploaded, about the time "16:11:15.074" (the log date time was lost by error log config ,just ignore it) the log print error "HRegionServer - * ABORTING region server... " before that ,there is a lot of wal "Slow sync" or "sync failed" ,and "Large batch operation detected " and gc pause is long (aoout 10 seconds many times), data hotspot in the region of this rs. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880645#comment-16880645 ] Wellington Chevreuil commented on HBASE-22665: -- Thanks [~chenyechao]! My theory here is that the sync failure from "16:06:06" shown in the error below let to _AsyncFSWAL.unackedAppends_ size be left indefinitely > 0, as [_AsyncFSWAL.syncFailed_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L291] never clears it out. Then, when _LogRoller_ triggers the log roll, reaching [_AsyncFSWAL.waitForSafePoint_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L627], which triggers new [consumer thread|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487] that will awake back the awaiting condition only if _AsyncFSWAL.unackedAppends_ is empty. [~Apache9], would you think we should clean out _AsyncFSWAL.unackedAppends_ when handling sync failed? Or maybe add extra check in this _[AsyncFSWAL.consume|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487]_ method? > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880897#comment-16880897 ] Duo Zhang commented on HBASE-22665: --- {quote} Duo Zhang, would you think we should clean out AsyncFSWAL.unackedAppends when handling sync failed? {quote} We haven't done this? IIRC we will move the entries from unackedAppends to toWriteAppends in syncFailed... > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881077#comment-16881077 ] Wellington Chevreuil commented on HBASE-22665: -- {quote}We haven't done this? IIRC we will move the entries from unackedAppends to toWriteAppends in syncFailed... {quote} Apparently, in [_syncFailed_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L314] now, we just iterate over current _unackedAppends_, adding its elements to _toWriteAppends_, but we never clean _unackedAppends_. From what I can see in the code, we only remove elements from _unackedAppends_ in [_finishSync_|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L416]. Looks like an edge case, where we syncFailed and no sync has ever been triggered after that (maybe syncers are aborted? I need to analyse that further yet), then log roll went up, LogRoller entered that _waitForSafePoint_ method, it triggered the consumer thread in WAITING_ROLL state, reaching [this condition|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L487], which is false, thus never notifying LogRoller thread. I'm not sure we can simply clear _unackedAppends_ in _syncFailed_, as some other methods are relying on this condition to decide what to do. Also, that could allow a log roll to happen when we still have entries to be synced on a present file. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881118#comment-16881118 ] Duo Zhang commented on HBASE-22665: --- Oh, you are right. Let me think how to fix this. Need to add more comments. Thanks a lot. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881338#comment-16881338 ] Wellington Chevreuil commented on HBASE-22665: -- I guess we can assume it's fine to sign the _readyForRollingCond_ if: {code:java} writerBroken(currentEpochAndState) and waitingRoll(currentEpochAndState){code} After all, the given writer won't be able to _sync_ anymore, right? Right now, we are just checking on the [_writerBroken_ condition first|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L479], ending consumer thread without doing nothing. Thus, if no more writes happen, _LogRoller_ thread may never move on. In the hypothetical situation I wondered about in my previous messages, where 1) sync failed, then 2) LogRoller kicked out, we have two state values that currently will prevent _waitForSafePoint_ from being awakened: 1) _writerBroken_ is true and 2) _unackedAppends_ not empty. We may get this fixed by change [this block|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L479] as follows: {code:java} if (writerBroken(currentEpochAndState)&&!waitingRoll(currentEpochAndState)) { return; } if (waitingRoll(currentEpochAndState)) { if (writer.getLength() > fileLengthAtLastSync) { // issue a sync sync(writer); } else { if (unackedAppends.isEmpty()||writerBroken(currentEpochAndState)) { readyForRolling = true; readyForRollingCond.signalAll(); } } return; } {code} > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881637#comment-16881637 ] Duo Zhang commented on HBASE-22665: --- Yes, it should be something like this [~wchevreuil]. But I want to write a UT to reproduce it first... Give me some time. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882178#comment-16882178 ] Duo Zhang commented on HBASE-22665: --- Oh, the waitingRoll state is modified inside the consumeLock so there is no race... Let's assume the writer is broken. If syncFailed is executed first, then it will mark the writer as broken, so in the waitForSafePoint method, it will return at the writerBroken check. If waitForSafePoint is executed first, then it will set the epochAndState to waitingRoll, so in the syncFailed method, we will wake up the threads waiting on readyForRollingCond. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882191#comment-16882191 ] Duo Zhang commented on HBASE-22665: --- The assumption here is that, after the waitForSafePoint call there will always be a consume call. If the syncFailed is happened at the beginning, the waitForSafePoint will return without waiting on the condition. If the syncFailed is happened in the middle, it will wake up the waitForSafePoint method, and the consume method will write to the new writer. If the syncFailed is happened at the end, the consume method will return after the writerBroken check, and then the syncFailed method will wake up the waitForSafePoint method. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882198#comment-16882198 ] Duo Zhang commented on HBASE-22665: --- And I also see some IllegalStateException in the log, which means we call close before calling flush, which is really strange... > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882211#comment-16882211 ] Yechao Chen commented on HBASE-22665: - more log was updated by rs.log_part2.zip which from 00:00-17:32(has container log.part1) > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1 > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882290#comment-16882290 ] Wellington Chevreuil commented on HBASE-22665: -- Good summary, [~Apache9]! Agree with your assumptions here, but what if log roll reaches _waitForSafepoint_ when _writerBroken_ is false, but no syncs ever happen anymore? Maybe something like: 1) _syncFailed_ -> writerBroken is set to true -> no waiting for rolling yet -> log roll is triggered by _syncFailed_ [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L320]; 2) log roll triggered by syncFailed will get _waitForSafepoint,_ but since _writerBroken_ is true, it doesn't wait; 3) log roll eventually reaches [this point|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L673] where _writerBroken_ state gets set back to false; 4) A periodic log roll is triggered, now this one will reach _waitForSafepoint_ with _writerBroken_ set to false, entering the waiting condition and with no additional syncs, never leaving it; Unfortunately, this code path does not add much logging messages that would give more solid evidences on this. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882799#comment-16882799 ] Duo Zhang commented on HBASE-22665: --- After 3), the new writer is up and we will schedule a new consume task to write the pending entries out, this is why we add the unackedAppends back to toWriteAppends, as we need to write them to the new writer. This will lead to a new sync. Anyway, will try to write an UT to fail the sync multiple times(maybe 1000) and also trigger log rollings, to see if we can finish without hanging. > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882803#comment-16882803 ] binlijin commented on HBASE-22665: -- Add the unackedAppends back to toWriteAppends, will make some edit log write to two hlogs, and when do log split/log recovery there may be encounter error log like : 2019-05-29 02:42:19,705 ERROR [RS_OPEN_REGION-regionserver/hbaseperf-9:16020-5] regionserver.HRegion: 8630a91ed22487e2e82b1d59b20ce4a4 : Found decreasing SeqId. PreId=9442752 key=mk:me/8630a91ed22487e2e82b1d59b20ce4a4/9442748; edit=[#edits: 1 = <546FA181C16-625D-4B98-A32D-BAA7F82B619E41559068398948/o:log/1559068398948/Put/vlen=1352/seqid=0; >] > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882821#comment-16882821 ] Duo Zhang commented on HBASE-22665: --- This is expected behavior. We can not just throw exception to upper layer as we do not know if it has been written out, so the only way is to retry forever... > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882852#comment-16882852 ] Wellington Chevreuil commented on HBASE-22665: -- {quote}After 3), the new writer is up and we will schedule a new consume task to write the pending entries out, this is why we add the unackedAppends back to toWriteAppends, as we need to write them to the new writer. This will lead to a new sync. {quote} Yep, and even if this new sync fails, as the stream is already damaged, per the log below, only way we could reach this deadlock is if it enters [this condition|https://github.com/apache/hbase/blob/branch-2.1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java#L297] on _syncFailed_, but I think the epoch would always be the same. {code:java} 16:06:06.466 [AsyncFSWAL-0] WARN org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL - sync failed java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) ~[hbase-server-2.1.2-**.jar:2.1.2-**] at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) ~[hbase-server-2.1.2-**.jar:2.1.2-**] at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:142) ~[hbase-server-2.1.2-**.jar:2.1.2-**] at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:351) ~[hbase-server-2.1.2-**.jar:2.1.2-**] at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:534) ~[hbase-server-2.1.2-**.jar:2.1.2-**] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] {code} {quote}Anyway, will try to write an UT to fail the sync multiple times(maybe 1000) and also trigger log rollings, to see if we can finish without hanging.{quote} Sounds like a hard to reproduce one. May need to forcibly close underlying OS as well? > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883018#comment-16883018 ] Duo Zhang commented on HBASE-22665: --- At least I must find a another bug with the new UT... {noformat} Exception in thread "AsyncFSWAL-0" java.lang.AssertionError at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.append(AbstractFSWAL.java:1009) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.appendAndSync(AsyncFSWAL.java:442) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:567) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} > RegionServer abort failed when AbstractFSWAL.shutdown hang > -- > > Key: HBASE-22665 > URL: https://issues.apache.org/jira/browse/HBASE-22665 > Project: HBase > Issue Type: Bug > Environment: HBase 2.1.2 > Hadoop 3.1.x > centos 7.4 >Reporter: Yechao Chen >Priority: Major > Attachments: HBASE-22665-UT.patch, image-2019-07-08-16-07-37-664.png, > image-2019-07-08-16-08-26-777.png, image-2019-07-08-16-14-43-455.png, > jstack_20190625, jstack_20190704_1, jstack_20190704_2, rs.log.part1, > rs.log_part2.zip > > > We use hbase 2.1.2,when the rs with heavy qps and rs abort with error like > "Caused by: org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to > get sync result after 30 ms for txid=36380334, WAL system stuck?" > > RegionServer aborted failed when AbstractFSWAL.shutdown hang > > jstack info always show the regionserver hang with "AbstractFSWAL.shutdown" > "regionserver/hbase-slave-216-99:16020" #25 daemon prio=5 os_prio=0 > tid=0x7f204282c600 nid=0x34aa waiting on condition [0x7f0fe044d000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f18a49b2bb8> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) > {color:#FF}at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285){color} > {color:#FF} at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:815){color} > at > org.apache.hadoop.hbase.wal.AbstractFSWALProvider.shutdown(AbstractFSWALProvider.java:168) > at > org.apache.hadoop.hbase.wal.RegionGroupingProvider.shutdown(RegionGroupingProvider.java:221) > at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:239) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1445) > {color:#FF}at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1117){color} > {color:#FF} at java.lang.Thread.run(Thread.java:745){color} > > > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (HBASE-22665) RegionServer abort failed when AbstractFSWAL.shutdown hang
[ https://issues.apache.org/jira/browse/HBASE-22665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233860#comment-17233860 ] Andrey Elenskiy commented on HBASE-22665: - Hitting exactly the same issue on 2.2.4. The issue was exposed when the host was under massive load causing various pauses and OOMs across the OS. All the handlers are stuck on SyncFuture.get() while AsyncFSWal thread is just waiting on condition. Also, regionserver didn't trigger AbstractFSWAL.shutdown() initially and was stuck without it. Only once I connected with jdb to check the state, the shutdown() was called and it was stuck exiting until a timeout killed the process (prior to that it was stuck appending). Here is how the handlers look like: {code:java} Thread 48 (RpcServer.default.FPBQ.Fifo.handler=7,queue=1,port=16201): State: TIMED_WAITING Blocked count: 1193588 Waited count: 2105228 Stack: java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:142) org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:722) org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:637) org.apache.hadoop.hbase.regionserver.HRegion.sync(HRegion.java:8588) org.apache.hadoop.hbase.regionserver.HRegion.doWALAppend(HRegion.java:7946) org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4130) org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4072) org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4003) org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1042) org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:974) org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:937) org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2755) org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:42290) org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318){code} How AsynFSWal looks like: {code:java} Thread 164 (AsyncFSWAL-0): State: WAITING Blocked count: 6198 Waited count: 32255423 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@bf1c663 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748){code} There were lots of logs like this right before it got stuck: {code:java} 2020-11-17 14:37:53,902 WARN [AsyncFSWAL-0] wal.AsyncFSWAL: sync failed java.io.IOException: Connection to 192.168.2.23/192.168.2.23:15010 closed at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.lambda$channelInactive$2(FanOutOneBlockAsyncDFSOutput.java:289) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.failed(FanOutOneBlockAsyncDFSOutput.java:236) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.access$300(FanOutOneBlockAsyncDFSOutput.java:99) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput$AckHandler.channelInactive(FanOutOneBlockAsyncDFSOutput.java:288) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) at org.apache.hbase.thirdparty.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:242) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:228) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:221) at org.apache.hbase.thirdparty.io.ne