[ https://issues.apache.org/jira/browse/HBASE-6719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13448578#comment-13448578 ]
terry zhang commented on HBASE-6719: ------------------------------------ I think we need to handle the IOException carefully and better not to skip the Hlog unless it is really corrupted. We can log this failture as a fatal in Log and skip the Hlog (by delete the hlog zk node manually ) if we have to. > [replication] Data will lose if open a Hlog failed more than > maxRetriesMultiplier > --------------------------------------------------------------------------------- > > Key: HBASE-6719 > URL: https://issues.apache.org/jira/browse/HBASE-6719 > Project: HBase > Issue Type: Bug > Components: replication > Affects Versions: 0.94.1 > Reporter: terry zhang > Assignee: terry zhang > Priority: Critical > Fix For: 0.94.2 > > Attachments: hbase-6719.patch > > > Please Take a look below code > {code:title=ReplicationSource.java|borderStyle=solid} > protected boolean openReader(int sleepMultiplier) { > { > ... > catch (IOException ioe) { > LOG.warn(peerClusterZnode + " Got: ", ioe); > // TODO Need a better way to determinate if a file is really gone but > // TODO without scanning all logs dir > if (sleepMultiplier == this.maxRetriesMultiplier) { > LOG.warn("Waited too long for this file, considering dumping"); > return !processEndOfFile(); // Open a file failed over > maxRetriesMultiplier(default 10) > } > } > return true; > ... > } > protected boolean processEndOfFile() { > if (this.queue.size() != 0) { // Skipped this Hlog . Data loss > this.currentPath = null; > this.position = 0; > return true; > } else if (this.queueRecovered) { // Terminate Failover Replication > source thread ,data loss > this.manager.closeRecoveredQueue(this); > LOG.info("Finished recovering the queue"); > this.running = false; > return true; > } > return false; > } > {code} > Some Time HDFS will meet some problem but actually Hlog file is OK , So after > HDFS back ,Some data will lose and can not find them back in slave cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira