Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
Based on Ben's helpful error description, I managed to reproduce this bug and found the root cause: There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't close a serialization stream before attempting to deserialize its serialized values, causing it to miss any data stored in

Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does

pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Ben Leslie
Hi, I'm trying to understand if there is any difference in correctness between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK). I can see that there may be differences in performance, but my expectation was that using either would result in the

Re: MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread Prannoy
It depends. If the data size on which the calculation is to be done is very large than caching it with MEMORY_AND_DISK is useful. Even in this case MEMORY_AND_DISK is useful if the computation on the RDD is expensive. If the compution is very small than even for large data sets MEMORY_ONLY can be

MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread sergunok
What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html Sent from the Apache Spark User List mailing