[ 
https://issues.apache.org/jira/browse/HBASE-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037218#comment-16037218
 ] 

Andrew Purtell commented on HBASE-18137:
----------------------------------------

bq. the workaround was to get a WAL file with just the header and then manually 
replace the 0 length files with header-but-no-edits file.

Reading this, it occurs to me maybe we want to handle this with a small 
enhancement to hbck that does this as a new fix strategy. Start by ensuring we 
emit warnings when a replication queue stalls. Take over or close HBASE-12125 
in lieu of this issue. Implement. Then operators can fix replication queues 
that have stalled due to HDFS level issues. There can also be an option for 
less conservative folks who prefer automatic handling of this condition. 
(There's the off chance the file's missing block will be found once an offline 
datanode is brought back online.)

> Replication gets stuck for empty WALs
> -------------------------------------
>
>                 Key: HBASE-18137
>                 URL: https://issues.apache.org/jira/browse/HBASE-18137
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1
>            Reporter: Ashu Pachauri
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.1.11, 1.2.7
>
>
> Replication assumes that only the last WAL of a recovered queue can be empty. 
> But, intermittent DFS issues may cause empty WALs being created (without the 
> PWAL magic), and a roll of WAL to happen without a regionserver crash. This 
> will cause recovered queues to have empty WALs in the middle. This cause 
> replication to get stuck:
> {code}
> TRACE regionserver.ReplicationSource: Opening log <wal_file>
> WARN regionserver.ReplicationSource: <peer_cluster_id>-<recovered_queue> Got: 
> java.io.EOFException
>       at java.io.DataInputStream.readFully(DataInputStream.java:197)
>       at java.io.DataInputStream.readFully(DataInputStream.java:169)
>       at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1843)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:168)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:177)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:66)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:312)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264)
>       at 
> org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830)
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572)
> {code}
> The WAL in question was completely empty but there were other WALs in the 
> recovered queue which were newer and non-empty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to