[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Zhong updated HBASE-8207:
---------------------------------

    Attachment: hbase-8207.patch

[~jeason] Thanks for the patch! Changing delimiter increase the existing user 
burden to apply the patch. I'm fine if we change it for 0.96 and onwards while 
different code base adds more work to support & developers in the future.

I came up a patch which keeps existing "-". The reason we can do that is that 
we only have one place to consume(parse) the znode string so we can handle the 
situation.

Please let me know what do you think?

In the patch, we also include two small fixes:

1) When replication is waiting for log splitting complete, there is no sleep in 
between so we keep hitting hdfs name node 
2) Replication checks possible file locations in reverse order, while in 
reality the possible log splitting location is from the first failed RS. After 
the fix, we can cut trips to NN too.

I'm still thinking how to include this in our tests so we can catch this error 
earlier.

Thanks,
-Jeffrey

                
> Replication could have data loss when machine name contains hyphen "-"
> ----------------------------------------------------------------------
>
>                 Key: HBASE-8207
>                 URL: https://issues.apache.org/jira/browse/HBASE-8207
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.95.0, 0.94.6
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.95.0, 0.98.0, 0.94.7
>
>         Attachments: failed.txt, hbase-8207.patch, HBASE-8212-94.patch
>
>
> In the recent test case TestReplication* failures, I'm finally able to find 
> the cause(or one of causes) for its intermittent failures.
> When a machine name contains "-", it breaks the function 
> ReplicationSource.checkIfQueueRecovered. It causes the following issue:
> deadRegionServers list is way off so that replication doesn't wait for log 
> splitting finish for a wal file and move on to the next one(data loss)
> You can see that replication use those weird paths constructed from 
> deadRegionServers to check a file existence
> {code}
> 2013-03-26 21:26:51,385 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,386 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,387 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,389 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,391 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,394 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,396 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,398 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> {code}
> This happened in the recent test failure in 
> http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
> Search for 
> {code}
> File does not exist: 
> hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
> {code}
> After 10 times retries, replication source gave up and move on to the next 
> file. Data loss happens. 
> Since lots of EC2 machine names contain "-" including our Jenkin servers, 
> this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to