Like you said, it depends both on the kind of network you have and the type of 
your workload.

Given your point about S3, I'd guess your input files/blocks are not large 
enough that moving code to data trumps moving data itself to the code. When 
that balance tilts a lot, especially when moving large input data files/blocks, 
data-locality will help improve performance significantly. That or when the 
read throughput from a remote desk << reading it from a local disk.

HTH
+Vinod

On Mar 21, 2014, at 7:06 PM, Mike Sam <mikesam...@gmail.com> wrote:

> How important is Data Locality to Hadoop? I mean, if we prefer to separate
> the HDFS cluster from the MR cluster, we will lose data locality but my
> question is how bad is this assuming we provider a reasonable network
> connection between the two clusters? EMR kills data locality when using S3
> as storage but we do not see a significant job time difference running same
> job from the HDFS cluster of the same setup. So, I am wondering
> how important is Data Locality to Hadoop in practice?
> 
> Thanks,
> Mike


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to