[ 
https://issues.apache.org/jira/browse/HDFS-5442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849350#comment-13849350
 ] 

Lohit Vijayarenu commented on HDFS-5442:
----------------------------------------

Thanks for sharing the design document. This looks to be very good start.
Few initial comments. It might be good to break up the work into two major 
features. 
1. BlockAllocation policy for cross datacenter (which I understand is 
synchronous replication from design document)
2. Asynchronous replication
This would give flexibility for users to chose either of the features based on 
their use case and infrastructure support. 
Few more very high level comments

- There seems to be assumption of replication of entire namespace at few 
places. This might not be desirable in many cases. Enabling this feature per 
directory or list of directories would be very useful. 
- There seems to be assumption of primary cluster and secondary cluster. Can 
this be chained to having something A->B and B->C. Or even the use case of A->B 
or B->A. Calling out those with configuration options would be very useful for 
cluster admins. 
- Another place which would need more information is about primary cluster NN 
tracking datanode information from secondary cluster (via secondary cluster 
NN). This needs to be thought to see if this is really scalable. This I assume 
would mean DataNode would have globally unique identifies now. How are failures 
of DataNodes handles and communicated back to Primary NN. How are DataNodes 
allocated for reads. How is space accounted for within clusters. Unique block 
ids across different clusters and such. Having more details on them will be 
very useful.
- Minor: It might be worth changing Primary/Secondary to Source/Destination 
cluster. It is little confusing when also thinking about Primary/Secondary 
NameNodes in same document.
- Adding few cases of failure and recover would be useful. For example in 
synchronous replication, what happens when secondary cluster is slow or down. 
How would data be re-replicated.
- How would ReplicationManager or changing replication of files work in general 
with this policy?

> Zero loss HDFS data replication for multiple datacenters
> --------------------------------------------------------
>
>                 Key: HDFS-5442
>                 URL: https://issues.apache.org/jira/browse/HDFS-5442
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Avik Dey
>         Attachments: Disaster Recovery Solution for Hadoop.pdf
>
>
> Hadoop is architected to operate efficiently at scale for normal hardware 
> failures within a datacenter. Hadoop is not designed today to handle 
> datacenter failures. Although HDFS is not designed for nor deployed in 
> configurations spanning multiple datacenters, replicating data from one 
> location to another is common practice for disaster recovery and global 
> service availability. There are current solutions available for batch 
> replication using data copy/export tools. However, while providing some 
> backup capability for HDFS data, they do not provide the capability to 
> recover all your HDFS data from a datacenter failure and be up and running 
> again with a fully operational Hadoop cluster in another datacenter in a 
> matter of minutes. For disaster recovery from a datacenter failure, we should 
> provide a fully distributed, zero data loss, low latency, high throughput and 
> secure HDFS data replication solution for multiple datacenter setup.
> Design and code for Phase-1 to follow soon.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to