Hi,

I'm trying to understand if there is any difference in correctness
between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).

I can see that there may be differences in performance, but my
expectation was that using either would result in the same behaviour.
However that is not what I'm seeing in practise.

Specifically I have some code like:

    text_lines = sc.textFile(input_files)
    records = text_lines.map(json.loads)
    records.persist(pyspark.StorageLevel.MEMORY_ONLY)
    count = records.count()
    records.unpersist()

When I do not use persist at all the 'count' variable contains the
correct value.
When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
get the correct, expected value.
However, if I use persist with no argument (or
pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
small.

In all cases the script completes without errors (or warning as far as
I can tell).

I'm using Spark 2.0.0 on an AWS EMR cluster.

It appears that the executors may not have enough memory to store all
the RDD partitions in memory only, however I thought in this case it
would fall back to regenerating from the parent RDD, rather than
providing the wrong answer.

Is this the expected behaviour? It seems a little difficult to work
with in practise.

Cheers,

Ben

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to