If data is on HDFS, it is not read any more or less quickly by either framework. Both are in fact using the same logic to exploit locality, and read and deserialize data anyway. I don't think this is what anyone claims though.
Spark can be faster in a multi-stage operation, which would require several MRs. The MRs must hit disk again after the reducer whereas Spark might not, possibly by persisting outputs in memory. A similar but larger speedup can be had for iterative computations that access the same data in memory; caching it means reading it from disk once, but then re-reading from memory only. For a single operation that really is a map and a reduce, starting and ending on HDFS, I would expect MR to be a bit faster just because it is so optimized for this one pattern. Even that depends a lot, and wouldn't be significant. On Sat, Apr 4, 2015 at 11:19 AM, SamyaMaiti <samya.maiti2...@gmail.com> wrote: > How is spark faster than MR when data is in disk in both cases? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org