[jira] [Updated] (HBASE-13782) RS stuck after FATAL ``FSHLog: Could not append.''

Mingjie Lai (JIRA) Tue, 26 May 2015 16:59:57 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mingjie Lai updated HBASE-13782:
--------------------------------
    Attachment: hbase-site.xml
                hbase-rs.log

Attached RS log and config files. 

> RS stuck after FATAL ``FSHLog: Could not append.''
> --------------------------------------------------
>
>                 Key: HBASE-13782
>                 URL: https://issues.apache.org/jira/browse/HBASE-13782
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>    Affects Versions: 1.0.1
>         Environment: hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
>            Reporter: Mingjie Lai
>            Priority: Critical
>         Attachments: hbase-rs.log, hbase-site.xml
>
>
> hbaes version: 1.0.0-cdh5.4.0
> hadoop version: 2.6.0-cdh5.4.0 
> Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a 
> 30-node yarn.
> We started to see that one RS stopped to serve any client request since 
> 2015-05-26 01:05:33, while all other RS were okay. I checked RS log and found 
> that there are some FATAL logs when 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}:
> {code}
> 2015-05-26 01:05:33,700 FATAL 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting 
> close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting 
> close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 FATAL 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting 
> close of wal
> java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600)
> 2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
> Archiving 
> hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
>  to 
> hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966
> 2015-05-26 01:05:33,701 ERROR 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close 
> of wal 
> {code}
> Since the HDFS cluster is shared with a YARN cluster, at the time, there were 
> some io heavy jobs running, and exhausted xciever at some of the DNs at the 
> exact same time. I think it's the reason why the RS got 
> ``java.io.IOException: Bad connect ack with firstBadLink''
> The problem is, the RS got stuck without any response since then. 
> flushQueueLength grew to the ceiling and stayed there. The only log entries 
> are from periodicFlusher:
> {code}
> 2015-05-26 02:06:26,742 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
> flush for region 
> myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159. after 
> a delay of 11000
> 2015-05-26 02:06:26,742 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
> flush for region 
> myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be. after 
> a delay of 7874
> 2015-05-26 02:06:26,742 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
> flush for region 
> myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c. after 
> a delay of 22740
> 2015-05-26 02:06:26,742 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
> flush for region 
> myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe. 
> after a delay of 3162
> 2015-05-26 02:06:26,742 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting 
> flush for region 
> myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920. after 
> a delay of 9142
> {code}
> Looks like there is a RS level deadlock triggered by the FATAL append 
> exception handling. In the end, I had to restart the RS service to rescue the 
> regions from the stuck RS.
> {code}
>       } catch (Exception e) {
>         LOG.fatal("Could not append. Requesting close of wal", e);
>         requestLogRoll();
>         throw e;
>       }
>       numEntries.incrementAndGet();
>     }
> {code}
> Maybe the RS can just suicide after the FATAL exception since it cannot 
> append WAL to hdfs? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13782) RS stuck after FATAL ``FSHLog: Could not append.''

Reply via email to