[ 
https://issues.apache.org/jira/browse/HBASE-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615963#comment-13615963
 ] 

Jean-Marc Spaggiari commented on HBASE-8207:
--------------------------------------------

Any thing which is not allowed as an host name might be acceptable, like pipe, 
% or like [~jxiang] proposed, a non printable delimiter. But if we need to 
print that on the logs, then the pipe option might be better? also, if this is 
stored in ZK, then indeed, we might need to keep the existing "-" + new 
delimiter for the reads, and the new delimiter for the writes...
                
> Replication could have data loss when machine name contains hyphen "-"
> ----------------------------------------------------------------------
>
>                 Key: HBASE-8207
>                 URL: https://issues.apache.org/jira/browse/HBASE-8207
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.95.0, 0.94.6
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.95.0, 0.98.0, 0.94.7
>
>         Attachments: failed.txt
>
>
> In the recent test case TestReplication* failures, I'm finally able to find 
> the cause(or one of causes) for its intermittent failures.
> When a machine name contains "-", it breaks the function 
> ReplicationSource.checkIfQueueRecovered. It causes the following issue:
> deadRegionServers list is way off so that replication doesn't wait for log 
> splitting finish for a wal file and move on to the next one(data loss)
> You can see that replication use those weird paths constructed from 
> deadRegionServers to check a file existence
> {code}
> 2013-03-26 21:26:51,385 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,386 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/1.compute.internal,52170,1364333181125-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,387 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/west/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,389 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/west-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,391 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,394 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/156.us-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,396 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/0/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> 2013-03-26 21:26:51,398 INFO  
> [ReplicationExecutor-0.replicationSource,2-ip-10-197-0-156.us-west-1.compute.internal,52170,1364333181125]
>  regionserver.ReplicationSource(524): Possible location 
> hdfs://localhost:52882/user/ec2-user/hbase/.logs/0-splitting/ip-10-197-0-156.us-west-1.compute.internal%252C52170%252C1364333181125.1364333199540
> {code}
> This happened in the recent test failure in 
> http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/21/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/?auto_refresh=false
> Search for 
> {code}
> File does not exist: 
> hdfs://localhost:52882/user/ec2-user/hbase/.oldlogs/ip-10-197-0-156.us-west-1.compute.internal%2C52170%2C1364333181125.1364333199540
> {code}
> After 10 times retries, replication source gave up and move on to the next 
> file. Data loss happens. 
> Since lots of EC2 machine names contain "-" including our Jenkin servers, 
> this is a high impact issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to