By default Spark will actually not keep the data at all, it will just store "how" to recreate the data. The programmer can however choose to keep the data once instantiated by calling "/.persist()/" or "/.cache()/" on the RDD. /.cache/ will store the data in-memory only and fail if it will not fit. /.persist/ will by default use memory but spill to disk if needed. /.persist(StorageLevel)/ allows you to write it all to disk (no in-memory overhead).
See: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence In addition, you can define your own StorageLevel and thus if you have magnetic and SSD disks you can choose to persist the data to the disk-level you want (depending on how "hot" you consider the data). Essentially, you have full freedom to do what you will with the data in Spark :) Hope this helps. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-use-more-memory-than-MapReduce-tp25030p25087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org