Like you said, it depends both on the kind of network you have and the type of your workload.
Given your point about S3, I'd guess your input files/blocks are not large enough that moving code to data trumps moving data itself to the code. When that balance tilts a lot, especially when moving large input data files/blocks, data-locality will help improve performance significantly. That or when the read throughput from a remote desk << reading it from a local disk. HTH +Vinod On Mar 21, 2014, at 7:06 PM, Mike Sam <mikesam...@gmail.com> wrote: > How important is Data Locality to Hadoop? I mean, if we prefer to separate > the HDFS cluster from the MR cluster, we will lose data locality but my > question is how bad is this assuming we provider a reasonable network > connection between the two clusters? EMR kills data locality when using S3 > as storage but we do not see a significant job time difference running same > job from the HDFS cluster of the same setup. So, I am wondering > how important is Data Locality to Hadoop in practice? > > Thanks, > Mike -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
signature.asc
Description: Message signed with OpenPGP using GPGMail