pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

Ben Leslie Fri, 09 Sep 2016 05:01:56 -0700

Hi,

I'm trying to understand if there is any difference in correctness
between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).


I can see that there may be differences in performance, but my
expectation was that using either would result in the same behaviour.
However that is not what I'm seeing in practise.

Specifically I have some code like:

    text_lines = sc.textFile(input_files)
    records = text_lines.map(json.loads)
    records.persist(pyspark.StorageLevel.MEMORY_ONLY)
    count = records.count()
    records.unpersist()

When I do not use persist at all the 'count' variable contains the
correct value.
When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
get the correct, expected value.
However, if I use persist with no argument (or
pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
small.

In all cases the script completes without errors (or warning as far as
I can tell).

I'm using Spark 2.0.0 on an AWS EMR cluster.

It appears that the executors may not have enough memory to store all
the RDD partitions in memory only, however I thought in this case it
would fall back to regenerating from the parent RDD, rather than
providing the wrong answer.

Is this the expected behaviour? It seems a little difficult to work
with in practise.

Cheers,

Ben

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

Reply via email to