[ https://issues.apache.org/jira/browse/HBASE-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721223#comment-14721223 ]
stack commented on HBASE-14317: ------------------------------- The ringbuffer processor is blocked waiting on outstanding syncs to come in {code} "regionserver/r12s16.sjc.aristanetworks.com/172.24.32.16:9104.append-pool1-t1" #140 prio=5 os_prio=0 tid=0x00007fbf5cc61800 nid=0xb2 in Object.wait() [0x00007fbf3a115000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:460) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.attainSafePoint(FSHLog.java:2024) - locked <0x0000000548756b60> (a java.lang.Object) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1999) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1910) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} All processing of the ringbuffer is held up until we attain safe point -- i.e. all syncers must come home (This is by design -- we are trying to roll logs so no more edits allowed in). Same 'hang' is to be found over in HBASE-13974 looking in its jstack1.txt. The 'fix' over in HBASE-13974 releases threads that are waiting on their sequenceid to come home; they are in the ring buffer behind the current point-of-processing/blockage. It looks like the blockage would persist after HBASE-13974 timeout 'fires'. The [~eclark] patch attached here where we timeout the root block would be a better workaround IMO till proper fix. Still at trying to manufacture the block 'naturally'. > Stuck FSHLog: bad disk (HDFS-8960) and can't roll WAL > ----------------------------------------------------- > > Key: HBASE-14317 > URL: https://issues.apache.org/jira/browse/HBASE-14317 > Project: HBase > Issue Type: Bug > Affects Versions: 1.2.0, 1.1.1 > Reporter: stack > Priority: Critical > Attachments: 14317.test.txt, HBASE-14317.patch, [Java] RS stuck on > WAL sync to a dead DN - Pastebin.com.html, raw.php, subset.of.rs.log > > > hbase-1.1.1 and hadoop-2.7.1 > We try to roll logs because can't append (See HDFS-8960) but we get stuck. > See attached thread dump and associated log. What is interesting is that > syncers are waiting to take syncs to run and at same time we want to flush so > we are waiting on a safe point but there seems to be nothing in our ring > buffer; did we go to roll log and not add safe point sync to clear out > ringbuffer? > Needs a bit of study. Try to reproduce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)