[ https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037223#comment-16037223 ]
Sean Busbey commented on HBASE-18137: ------------------------------------- yeah that sounds like a good approach. is the automatic handling opt-in or opt-out? > Replication gets stuck for empty WALs > ------------------------------------- > > Key: HBASE-18137 > URL: https://issues.apache.org/jira/browse/HBASE-18137 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.3.1 > Reporter: Ashu Pachauri > Assignee: Vincent Poon > Priority: Critical > Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7 > > > Replication assumes that only the last WAL of a recovered queue can be empty. > But, intermittent DFS issues may cause empty WALs being created (without the > PWAL magic), and a roll of WAL to happen without a regionserver crash. This > will cause recovered queues to have empty WALs in the middle. This cause > replication to get stuck: > {code} > TRACE regionserver.ReplicationSource: Opening log <wal_file> > WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915) > at > org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) > {code} > The WAL in question was completely empty but there were other WALs in the > recovered queue which were newer and non-empty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)