[ https://issues.apache.org/jira/browse/HBASE-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457998#comment-13457998 ]
Devaraj Das commented on HBASE-6758: ------------------------------------ [~yuzhih...@gmail.com] Hey thanks for taking the patch for a spin. Talk about races! Here it seems like the splitter didn't complete within the expected time, and the replication didn't happen for some data. Here are the relevant log snippets (look for "considering dumping" where the file got dropped before the splitter completed). But in this case, the issue can be addressed by increasing the number of retries (which is already configurable). The patch attached here doesn't attempt to solve this problem. {noformat} 2012-09-17 18:13:03,665 WARN [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSource(555): 2-sea-lab-0,41831,1347930742751 Got: java.io.IOException: File from recovered queue is nowhere to be found at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:537) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:304) Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:58) at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:166) at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:689) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503) ... 1 more 2012-09-17 18:13:03,665 WARN [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSource(559): Waited too long for this file, considering dumping 2012-09-17 18:13:03,665 INFO [ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] regionserver.ReplicationSourceManager(365): Done with the recovered queue 2-sea-lab-0,41831,1347930742751 2012-09-17 18:13:04,305 DEBUG [main-EventThread] wal.HLogSplitter(657): Archived processed log hdfs://localhost:41196/user/hduser/hbase/.logs/sea-lab-0,41831,1347930742751-splitting/sea-lab-0%2C41831%2C1347930742751.1347930771911 to hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911 2012-09-17 18:13:04,306 INFO [main-EventThread] master.SplitLogManager(392): Done splitting /1/splitlog/hdfs%3A%2F%2Flocalhost%3A41196%2Fuser%2Fhduser%2Fhbase%2F.logs%2Fsea-lab-0%2C41831%2C1347930742751-splitting%2Fsea-lab-0%252C41831%252C1347930742751.1347930771911 {noformat} > [replication] The replication-executor should make sure the file that it is > replicating is closed before declaring success on that file > --------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6758 > URL: https://issues.apache.org/jira/browse/HBASE-6758 > Project: HBase > Issue Type: Bug > Reporter: Devaraj Das > Assignee: Devaraj Das > Attachments: 6758-1-0.92.patch, 6758-2-0.92.patch, > TEST-org.apache.hadoop.hbase.replication.TestReplication.xml > > > I have seen cases where the replication-executor would lose data to replicate > since the file hasn't been closed yet. Upon closing, the new data becomes > visible. Before that happens the ZK node shouldn't be deleted in > ReplicationSourceManager.logPositionAndCleanOldLogs. Changes need to be made > in ReplicationSource.processEndOfFile as well (currentPath related). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira