Re: Does Spark use more memory than MapReduce?

2015-10-16 Thread Gylfi
By default Spark will actually not keep the data at all, it will just store
"how" to recreate the data. 
The programmer can however choose to keep the data once instantiated by
calling "/.persist()/" or "/.cache()/" on the RDD. 
/.cache/ will store the data in-memory only and fail if it will not fit. 
/.persist/ will by default use memory but spill to disk if needed. 
/.persist(StorageLevel)/ allows you to write it all to disk (no in-memory
overhead). 

See:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

In addition, you can define your own StorageLevel and thus if you have
magnetic and SSD disks you can choose to persist the data to the disk-level
you want (depending on how "hot" you consider the data). 

Essentially, you have full freedom to do what you will with the data in
Spark :)  

Hope this helps. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030p25087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does Spark use more memory than MapReduce?

2015-10-12 Thread Jean-Baptiste Onofré

Hi,

I think it depends of the storage level you use (MEMORY, DISK, or 
MEMORY_AND_DISK).


By default, micro-batching as designed in Spark requires more memory but 
much faster: when you use MapReduce, each map and reduce tasks have to 
use HDFS as backend of the data pipeline between the tasks. In Spark, 
disk flush is not always performed: it tries to keep data in memory as 
much as possible. So, it's balance to find between fast 
processing/micro-batching and memory consumption.
In some cases, using the disk is faster anyway (for instance, a 
MapReduce shuffle can be faster than a Spark shuffle, but you have an 
option to run a ShuffleMapReduceTask from Spark).


I'm speaking under cover of the experts ;)

Regards
JB

On 10/12/2015 06:52 PM, YaoPau wrote:

I had this question come up and I'm not sure how to answer it.  A user said
that, for a big job, he thought it would be better to use MapReduce since it
writes to disk between iterations instead of keeping the data in memory the
entire time like Spark generally does.

I mentioned that Spark can cache to disk as well, but I'm not sure about the
overarching question (which I realize is vague): for a typical job, would
Spark use more memory than a MapReduce job?  Are there any memory usage
inefficiencies from either?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does Spark use more memory than MapReduce?

2015-10-12 Thread YaoPau
I had this question come up and I'm not sure how to answer it.  A user said
that, for a big job, he thought it would be better to use MapReduce since it
writes to disk between iterations instead of keeping the data in memory the
entire time like Spark generally does.

I mentioned that Spark can cache to disk as well, but I'm not sure about the
overarching question (which I realize is vague): for a typical job, would
Spark use more memory than a MapReduce job?  Are there any memory usage
inefficiencies from either?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org