there isn't any multi-DC support in hadoop core right now -block placement, work scheduling is done on the assumption that there is a bandwidth cost for working across that backbone, but it is still gigabit rate all the way up its tree. Even if your cross site bandwidth is that high, other assumptions in the Hadoop code surface
- latency of cross-rack communications is low and not significantly different from intra-rack comms - protocols between machines can assume that packet failures are the kind that surface on a LAN, not a WAN (lower failure rate, less buffered/bursty traffic) - external infrastructure (like DNS, NTP) are consistent everywhere - Network partitions are rare and significant enough to react to by re-replicating data -an expensive operation and overkill for a transient WAN outage. Zookeeper will be extra-brittle here, as group membership protocols tend to have aggressive timeouts to detect loss of members fast. I'd be really surprised if it worked well across >1 site. Wandisco do provide multi-DC hadoop support for Hadoop platforms: http://www.wandisco.com/hadoop/non-stop-hadoop - I don't know what that does about applications running on the cluster, or depend on ZK. I suspect that anyone working across sites is going to have to run ZK & Accumulo on each, and treat operations that span sites as separate queries where the results need to be merged in -something that would need to go into the query engines, though for now you could build a workflow that did an MR on each site, then an final reduce on one of the sites -Steve On 22 January 2014 19:01, Jeff N <[email protected]> wrote: > @Adam > I am currently interested with the latter half of your second question. My > main interest lies in determining how to optimize data processing. If I > have > two data centers that are geographically far apart and I am working on a > local machines but I need data from the second data center, how do I have > the processing occur on the second data center? The constraints to this > problem include a lack of empirical knowledge of the HDFS node that the > data > contains, but is within the network topology I currently reside in. > Furthermore, it pertains to Map/Reduce jobs that utilize the > AccumuloInputFormat. Is it possible to have the distant data center process > my Mapper and send me the resulting data set instead of processing the > Mapper locally and making numerous network queries? > > > > ----- > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/Rack-and-Datacenter-Awareness-tp7193p7225.html > Sent from the Developers mailing list archive at Nabble.com. > -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
