How important is Data Locality to Hadoop? I mean, if we prefer to separate the HDFS cluster from the MR cluster, we will lose data locality but my question is how bad is this assuming we provider a reasonable network connection between the two clusters? EMR kills data locality when using S3 as storage but we do not see a significant job time difference running same job from the HDFS cluster of the same setup. So, I am wondering how important is Data Locality to Hadoop in practice?
Thanks, Mike