[ https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677318#comment-16677318 ]
Sean Busbey commented on HBASE-20604: ------------------------------------- {code} @@ -416,7 +420,15 @@ public class ProtobufLogReader extends ReaderBase { if (LOG.isTraceEnabled()) { LOG.trace("Encountered a malformed edit, seeking back to last good position in file, from "+ inputStream.getPos()+" to " + originalPosition, eof); } - seekOnFs(originalPosition); + // If stuck at the same place and we got and exception, lets go back at the beginning. + if (inputStream.getPos() == originalPosition && resetPosition) { + if (LOG.isTraceEnabled()) { + LOG.trace("Seeking to the beginning of the WAL, current position " + originalPosition + " is the same as the original position."); + } + seekOnFs(0); + } else { + seekOnFs(originalPosition); + } {code} The {{LOG.trace}} block just before this addition should be inside of the {{else}} clause that's added, because currently in the "reset to start" case we're effectively duplicating the TRACE messages. After the above, the {{LOG.trace}} message provided when we seek to the start should include in the why ("original and current positions match") that we got a malformed edit. With those two changes and the long line from checkstyle corrected, I'm +1. > ProtobufLogReader#readNext can incorrectly loop to the same position in the > stream until the the WAL is rolled > -------------------------------------------------------------------------------------------------------------- > > Key: HBASE-20604 > URL: https://issues.apache.org/jira/browse/HBASE-20604 > Project: HBase > Issue Type: Bug > Components: Replication, wal > Affects Versions: 3.0.0 > Reporter: Esteban Gutierrez > Assignee: Esteban Gutierrez > Priority: Critical > Attachments: HBASE-20604.002.patch, HBASE-20604.patch > > > Every time we call {{ProtobufLogReader#readNext}} we consume the input stream > associated to the {{FSDataInputStream}} from the WAL that we are reading. > Under certain conditions, e.g. when using the encryption at rest > ({{CryptoInputStream}}) the stream can return partial data which can cause a > premature EOF that cause {{inputStream.getPos()}} to return to the same > origina position causing {{ProtobufLogReader#readNext}} to re-try over the > reads until the WAL is rolled. > The side effect of this issue is that {{ReplicationSource}} can get stuck > until the WAL is rolled and causing replication delays up to an hour in some > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)