[ https://issues.apache.org/jira/browse/HBASE-13782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mingjie Lai updated HBASE-13782: -------------------------------- Attachment: hbase-site.xml hbase-rs.log Attached RS log and config files. > RS stuck after FATAL ``FSHLog: Could not append.'' > -------------------------------------------------- > > Key: HBASE-13782 > URL: https://issues.apache.org/jira/browse/HBASE-13782 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 1.0.1 > Environment: hbaes version: 1.0.0-cdh5.4.0 > hadoop version: 2.6.0-cdh5.4.0 > Reporter: Mingjie Lai > Priority: Critical > Attachments: hbase-rs.log, hbase-site.xml > > > hbaes version: 1.0.0-cdh5.4.0 > hadoop version: 2.6.0-cdh5.4.0 > Environment: 40-node hadoop cluster shared with a 10-node hbase cluster and a > 30-node yarn. > We started to see that one RS stopped to serve any client request since > 2015-05-26 01:05:33, while all other RS were okay. I checked RS log and found > that there are some FATAL logs when > org.apache.hadoop.hbase.regionserver.wal.FSHLog tried to append() and sync{}: > {code} > 2015-05-26 01:05:33,700 FATAL > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting > close of wal > java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600) > 2015-05-26 01:05:33,700 FATAL > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting > close of wal > java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600) > 2015-05-26 01:05:33,700 FATAL > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Could not append. Requesting > close of wal > java.io.IOException: Bad connect ack with firstBadLink as 10.28.1.17:50010 > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1472) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1373) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:600) > 2015-05-26 01:05:33,700 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: > Archiving > hdfs://nameservice1/hbase/WALs/hbase08.company.com,60020,1431985722474/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966 > to > hdfs://nameservice1/hbase/oldWALs/hbase08.company.com%2C60020%2C1431985722474.default.1432602140966 > 2015-05-26 01:05:33,701 ERROR > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error syncing, request close > of wal > {code} > Since the HDFS cluster is shared with a YARN cluster, at the time, there were > some io heavy jobs running, and exhausted xciever at some of the DNs at the > exact same time. I think it's the reason why the RS got > ``java.io.IOException: Bad connect ack with firstBadLink'' > The problem is, the RS got stuck without any response since then. > flushQueueLength grew to the ceiling and stayed there. The only log entries > are from periodicFlusher: > {code} > 2015-05-26 02:06:26,742 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting > flush for region > myns:mytable,3992+80bb1,1432526964367.c4906e519c1f8206a284c66a8eda2159. after > a delay of 11000 > 2015-05-26 02:06:26,742 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting > flush for region > myns:mytable,0814+0416,1432541066864.cf42d5ab47e051d69e516971e82e84be. after > a delay of 7874 > 2015-05-26 02:06:26,742 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting > flush for region > myns:mytable,2022+7a571,1432528246524.299c1d4bb28fda2a4d9f248c6c22153c. after > a delay of 22740 > 2015-05-26 02:06:26,742 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting > flush for region > myns:mytable,2635+b9b677,1432540367215.749efc885317a2679e2ea39bb0845fbe. > after a delay of 3162 > 2015-05-26 02:06:26,742 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: > regionserver/hbase08.company.com/10.28.1.6:60020.periodicFlusher requesting > flush for region > myns:mytable,0401+985e,1432527151473.eb97576381fce10a9616efd471103920. after > a delay of 9142 > {code} > Looks like there is a RS level deadlock triggered by the FATAL append > exception handling. In the end, I had to restart the RS service to rescue the > regions from the stuck RS. > {code} > } catch (Exception e) { > LOG.fatal("Could not append. Requesting close of wal", e); > requestLogRoll(); > throw e; > } > numEntries.incrementAndGet(); > } > {code} > Maybe the RS can just suicide after the FATAL exception since it cannot > append WAL to hdfs? -- This message was sent by Atlassian JIRA (v6.3.4#6332)