[jira] [Commented] (HBASE-6758) [replication] The replication-executor should make sure the file that it is replicating is closed before declaring success on that file

Devaraj Das (JIRA) Tue, 18 Sep 2012 10:56:09 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457998#comment-13457998
 ]


Devaraj Das commented on HBASE-6758:
------------------------------------

[~yuzhih...@gmail.com] Hey thanks for taking the patch for a spin.

Talk about races! Here it seems like the splitter didn't complete within the 
expected time, and the replication didn't happen for some data. 

Here are the relevant log snippets (look for "considering dumping" where the 
file got dropped before the splitter completed). But in this case, the issue 
can be addressed by increasing the number of retries (which is already 
configurable). The patch attached here doesn't attempt to solve this problem.

{noformat}

2012-09-17 18:13:03,665 WARN  
[ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] 
regionserver.ReplicationSource(555): 2-sea-lab-0,41831,1347930742751 Got:
java.io.IOException: File from recovered queue is nowhere to be found
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:537)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:304)
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
        at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1475)
        at 
org.apache.hadoop.io.SequenceFile$Reader.&lt;init&gt;(SequenceFile.java:1470)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.&lt;init&gt;(SequenceFileLogReader.java:58)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:166)
        at 
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:689)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503)
        ... 1 more

2012-09-17 18:13:03,665 WARN  
[ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] 
regionserver.ReplicationSource(559): Waited too long for this file, considering 
dumping

2012-09-17 18:13:03,665 INFO  
[ReplicationExecutor-0.replicationSource,2-sea-lab-0,41831,1347930742751] 
regionserver.ReplicationSourceManager(365): Done with the recovered queue 
2-sea-lab-0,41831,1347930742751

2012-09-17 18:13:04,305 DEBUG [main-EventThread] wal.HLogSplitter(657): 
Archived processed log 
hdfs://localhost:41196/user/hduser/hbase/.logs/sea-lab-0,41831,1347930742751-splitting/sea-lab-0%2C41831%2C1347930742751.1347930771911
 to 
hdfs://localhost:41196/user/hduser/hbase/.oldlogs/sea-lab-0%2C41831%2C1347930742751.1347930771911

2012-09-17 18:13:04,306 INFO  [main-EventThread] master.SplitLogManager(392): 
Done splitting 
/1/splitlog/hdfs%3A%2F%2Flocalhost%3A41196%2Fuser%2Fhduser%2Fhbase%2F.logs%2Fsea-lab-0%2C41831%2C1347930742751-splitting%2Fsea-lab-0%252C41831%252C1347930742751.1347930771911

{noformat}
                
> [replication] The replication-executor should make sure the file that it is 
> replicating is closed before declaring success on that file
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6758
>                 URL: https://issues.apache.org/jira/browse/HBASE-6758
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>         Attachments: 6758-1-0.92.patch, 6758-2-0.92.patch, 
> TEST-org.apache.hadoop.hbase.replication.TestReplication.xml
>
>
> I have seen cases where the replication-executor would lose data to replicate 
> since the file hasn't been closed yet. Upon closing, the new data becomes 
> visible. Before that happens the ZK node shouldn't be deleted in 
> ReplicationSourceManager.logPositionAndCleanOldLogs. Changes need to be made 
> in ReplicationSource.processEndOfFile as well (currentPath related).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6758) [replication] The replication-executor should make sure the file that it is replicating is closed before declaring success on that file

Reply via email to