If data is on HDFS, it is not read any more or less quickly by either
framework. Both are in fact using the same logic to exploit locality,
and read and deserialize data anyway. I don't think this is what
anyone claims though.

Spark can be faster in a multi-stage operation, which would require
several MRs. The MRs must hit disk again after the reducer whereas
Spark might not, possibly by persisting outputs in memory. A similar
but larger speedup can be had for iterative computations that access
the same data in memory; caching it means reading it from disk once,
but then re-reading from memory only.

For a single operation that really is a map and a reduce, starting and
ending on HDFS, I would expect MR to be a bit faster just because it
is so optimized for this one pattern. Even that depends a lot, and
wouldn't be significant.


On Sat, Apr 4, 2015 at 11:19 AM, SamyaMaiti <samya.maiti2...@gmail.com> wrote:
> How is spark faster than MR when data is in disk in both cases?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Vs-MR-tp22373.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to