Re: Does Spark use more memory than MapReduce?
By default Spark will actually not keep the data at all, it will just store "how" to recreate the data. The programmer can however choose to keep the data once instantiated by calling "/.persist()/" or "/.cache()/" on the RDD. /.cache/ will store the data in-memory only and fail if it will not fit. /.persist/ will by default use memory but spill to disk if needed. /.persist(StorageLevel)/ allows you to write it all to disk (no in-memory overhead). See: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence In addition, you can define your own StorageLevel and thus if you have magnetic and SSD disks you can choose to persist the data to the disk-level you want (depending on how "hot" you consider the data). Essentially, you have full freedom to do what you will with the data in Spark :) Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030p25087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does Spark use more memory than MapReduce?
Hi, I think it depends of the storage level you use (MEMORY, DISK, or MEMORY_AND_DISK). By default, micro-batching as designed in Spark requires more memory but much faster: when you use MapReduce, each map and reduce tasks have to use HDFS as backend of the data pipeline between the tasks. In Spark, disk flush is not always performed: it tries to keep data in memory as much as possible. So, it's balance to find between fast processing/micro-batching and memory consumption. In some cases, using the disk is faster anyway (for instance, a MapReduce shuffle can be faster than a Spark shuffle, but you have an option to run a ShuffleMapReduceTask from Spark). I'm speaking under cover of the experts ;) Regards JB On 10/12/2015 06:52 PM, YaoPau wrote: I had this question come up and I'm not sure how to answer it. A user said that, for a big job, he thought it would be better to use MapReduce since it writes to disk between iterations instead of keeping the data in memory the entire time like Spark generally does. I mentioned that Spark can cache to disk as well, but I'm not sure about the overarching question (which I realize is vague): for a typical job, would Spark use more memory than a MapReduce job? Are there any memory usage inefficiencies from either? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Does Spark use more memory than MapReduce?
I had this question come up and I'm not sure how to answer it. A user said that, for a big job, he thought it would be better to use MapReduce since it writes to disk between iterations instead of keeping the data in memory the entire time like Spark generally does. I mentioned that Spark can cache to disk as well, but I'm not sure about the overarching question (which I realize is vague): for a typical job, would Spark use more memory than a MapReduce job? Are there any memory usage inefficiencies from either? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org