Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-10 Thread Josh Rosen
Based on Ben's helpful error description, I managed to reproduce this bug
and found the root cause:

There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't
close a serialization stream before attempting to deserialize its
serialized values, causing it to miss any data stored in the serializer's
internal buffers (which can happen with KryoSerializer, which was
automatically being used to serialize RDDs of byte arrays). I've reported
this as https://issues.apache.org/jira/browse/SPARK-17491 and have submitted
 https://github.com/apache/spark/pull/15043 to fix this (I'm still planning
to add more tests to that patch).

On Fri, Sep 9, 2016 at 10:37 AM Josh Rosen  wrote:

> cache() / persist() is definitely *not* supposed to affect the result of
> a program, so the behavior that you're seeing is unexpected.
>
> I'll try to reproduce this myself by caching in PySpark under heavy memory
> pressure, but in the meantime the following questions will help me to debug:
>
>- Does this only happen in Spark 2.0? Have you successfully run the
>same workload with correct behavior on an earlier Spark version, such as
>1.6.x?
>- How accurately does your example code model the structure of your
>real code? Are you calling cache()/persist() on an RDD which has been
>transformed in Python or are you calling it on an untransformed input RDD
>(such as the RDD returned from sc.textFile() / sc.hadoopFile())?
>
>
> On Fri, Sep 9, 2016 at 5:01 AM Ben Leslie  wrote:
>
>> Hi,
>>
>> I'm trying to understand if there is any difference in correctness
>> between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
>> rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).
>>
>> I can see that there may be differences in performance, but my
>> expectation was that using either would result in the same behaviour.
>> However that is not what I'm seeing in practise.
>>
>> Specifically I have some code like:
>>
>> text_lines = sc.textFile(input_files)
>> records = text_lines.map(json.loads)
>> records.persist(pyspark.StorageLevel.MEMORY_ONLY)
>> count = records.count()
>> records.unpersist()
>>
>> When I do not use persist at all the 'count' variable contains the
>> correct value.
>> When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
>> get the correct, expected value.
>> However, if I use persist with no argument (or
>> pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
>> small.
>>
>> In all cases the script completes without errors (or warning as far as
>> I can tell).
>>
>> I'm using Spark 2.0.0 on an AWS EMR cluster.
>>
>> It appears that the executors may not have enough memory to store all
>> the RDD partitions in memory only, however I thought in this case it
>> would fall back to regenerating from the parent RDD, rather than
>> providing the wrong answer.
>>
>> Is this the expected behaviour? It seems a little difficult to work
>> with in practise.
>>
>> Cheers,
>>
>> Ben
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Josh Rosen
cache() / persist() is definitely *not* supposed to affect the result of a
program, so the behavior that you're seeing is unexpected.

I'll try to reproduce this myself by caching in PySpark under heavy memory
pressure, but in the meantime the following questions will help me to debug:

   - Does this only happen in Spark 2.0? Have you successfully run the same
   workload with correct behavior on an earlier Spark version, such as 1.6.x?
   - How accurately does your example code model the structure of your real
   code? Are you calling cache()/persist() on an RDD which has been
   transformed in Python or are you calling it on an untransformed input RDD
   (such as the RDD returned from sc.textFile() / sc.hadoopFile())?


On Fri, Sep 9, 2016 at 5:01 AM Ben Leslie  wrote:

> Hi,
>
> I'm trying to understand if there is any difference in correctness
> between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
> rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).
>
> I can see that there may be differences in performance, but my
> expectation was that using either would result in the same behaviour.
> However that is not what I'm seeing in practise.
>
> Specifically I have some code like:
>
> text_lines = sc.textFile(input_files)
> records = text_lines.map(json.loads)
> records.persist(pyspark.StorageLevel.MEMORY_ONLY)
> count = records.count()
> records.unpersist()
>
> When I do not use persist at all the 'count' variable contains the
> correct value.
> When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
> get the correct, expected value.
> However, if I use persist with no argument (or
> pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
> small.
>
> In all cases the script completes without errors (or warning as far as
> I can tell).
>
> I'm using Spark 2.0.0 on an AWS EMR cluster.
>
> It appears that the executors may not have enough memory to store all
> the RDD partitions in memory only, however I thought in this case it
> would fall back to regenerating from the parent RDD, rather than
> providing the wrong answer.
>
> Is this the expected behaviour? It seems a little difficult to work
> with in practise.
>
> Cheers,
>
> Ben
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK

2016-09-09 Thread Ben Leslie
Hi,

I'm trying to understand if there is any difference in correctness
between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and
rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK).

I can see that there may be differences in performance, but my
expectation was that using either would result in the same behaviour.
However that is not what I'm seeing in practise.

Specifically I have some code like:

text_lines = sc.textFile(input_files)
records = text_lines.map(json.loads)
records.persist(pyspark.StorageLevel.MEMORY_ONLY)
count = records.count()
records.unpersist()

When I do not use persist at all the 'count' variable contains the
correct value.
When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also
get the correct, expected value.
However, if I use persist with no argument (or
pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too
small.

In all cases the script completes without errors (or warning as far as
I can tell).

I'm using Spark 2.0.0 on an AWS EMR cluster.

It appears that the executors may not have enough memory to store all
the RDD partitions in memory only, however I thought in this case it
would fall back to regenerating from the parent RDD, rather than
providing the wrong answer.

Is this the expected behaviour? It seems a little difficult to work
with in practise.

Cheers,

Ben

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread Prannoy
It depends. If the data size on which the calculation is to be done is very
large than caching it with MEMORY_AND_DISK is useful. Even in this
case MEMORY_AND_DISK
is useful if the computation on the RDD is expensive. If the compution is
very small than even for large data sets MEMORY_ONLY can be used.  But if
data size is small, than using MEMORY_ONLY is a obviously the best option.

On Thu, Mar 19, 2015 at 2:35 AM, sergunok [via Apache Spark User List] 
ml-node+s1001560n22130...@n3.nabble.com wrote:

 What persistance level is better if RDD to be cached is heavily to be
 recalculated?
 Am I right it is MEMORY_AND_DISK?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130p22140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

MEMORY_ONLY vs MEMORY_AND_DISK

2015-03-18 Thread sergunok
What persistance level is better if RDD to be cached is heavily to be
recalculated?
Am I right it is MEMORY_AND_DISK?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org