Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK
Based on Ben's helpful error description, I managed to reproduce this bug and found the root cause: There's a bug in MemoryStore's PartiallySerializedBlock class: it doesn't close a serialization stream before attempting to deserialize its serialized values, causing it to miss any data stored in the serializer's internal buffers (which can happen with KryoSerializer, which was automatically being used to serialize RDDs of byte arrays). I've reported this as https://issues.apache.org/jira/browse/SPARK-17491 and have submitted https://github.com/apache/spark/pull/15043 to fix this (I'm still planning to add more tests to that patch). On Fri, Sep 9, 2016 at 10:37 AM Josh Rosenwrote: > cache() / persist() is definitely *not* supposed to affect the result of > a program, so the behavior that you're seeing is unexpected. > > I'll try to reproduce this myself by caching in PySpark under heavy memory > pressure, but in the meantime the following questions will help me to debug: > >- Does this only happen in Spark 2.0? Have you successfully run the >same workload with correct behavior on an earlier Spark version, such as >1.6.x? >- How accurately does your example code model the structure of your >real code? Are you calling cache()/persist() on an RDD which has been >transformed in Python or are you calling it on an untransformed input RDD >(such as the RDD returned from sc.textFile() / sc.hadoopFile())? > > > On Fri, Sep 9, 2016 at 5:01 AM Ben Leslie wrote: > >> Hi, >> >> I'm trying to understand if there is any difference in correctness >> between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and >> rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK). >> >> I can see that there may be differences in performance, but my >> expectation was that using either would result in the same behaviour. >> However that is not what I'm seeing in practise. >> >> Specifically I have some code like: >> >> text_lines = sc.textFile(input_files) >> records = text_lines.map(json.loads) >> records.persist(pyspark.StorageLevel.MEMORY_ONLY) >> count = records.count() >> records.unpersist() >> >> When I do not use persist at all the 'count' variable contains the >> correct value. >> When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also >> get the correct, expected value. >> However, if I use persist with no argument (or >> pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too >> small. >> >> In all cases the script completes without errors (or warning as far as >> I can tell). >> >> I'm using Spark 2.0.0 on an AWS EMR cluster. >> >> It appears that the executors may not have enough memory to store all >> the RDD partitions in memory only, however I thought in this case it >> would fall back to regenerating from the parent RDD, rather than >> providing the wrong answer. >> >> Is this the expected behaviour? It seems a little difficult to work >> with in practise. >> >> Cheers, >> >> Ben >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK
cache() / persist() is definitely *not* supposed to affect the result of a program, so the behavior that you're seeing is unexpected. I'll try to reproduce this myself by caching in PySpark under heavy memory pressure, but in the meantime the following questions will help me to debug: - Does this only happen in Spark 2.0? Have you successfully run the same workload with correct behavior on an earlier Spark version, such as 1.6.x? - How accurately does your example code model the structure of your real code? Are you calling cache()/persist() on an RDD which has been transformed in Python or are you calling it on an untransformed input RDD (such as the RDD returned from sc.textFile() / sc.hadoopFile())? On Fri, Sep 9, 2016 at 5:01 AM Ben Lesliewrote: > Hi, > > I'm trying to understand if there is any difference in correctness > between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and > rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK). > > I can see that there may be differences in performance, but my > expectation was that using either would result in the same behaviour. > However that is not what I'm seeing in practise. > > Specifically I have some code like: > > text_lines = sc.textFile(input_files) > records = text_lines.map(json.loads) > records.persist(pyspark.StorageLevel.MEMORY_ONLY) > count = records.count() > records.unpersist() > > When I do not use persist at all the 'count' variable contains the > correct value. > When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also > get the correct, expected value. > However, if I use persist with no argument (or > pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too > small. > > In all cases the script completes without errors (or warning as far as > I can tell). > > I'm using Spark 2.0.0 on an AWS EMR cluster. > > It appears that the executors may not have enough memory to store all > the RDD partitions in memory only, however I thought in this case it > would fall back to regenerating from the parent RDD, rather than > providing the wrong answer. > > Is this the expected behaviour? It seems a little difficult to work > with in practise. > > Cheers, > > Ben > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK
Hi, I'm trying to understand if there is any difference in correctness between rdd.persist(pyspark.StorageLevel.MEMORY_ONLY) and rdd.persist(pyspark.StorageLevel.MEMORY_AND_DISK). I can see that there may be differences in performance, but my expectation was that using either would result in the same behaviour. However that is not what I'm seeing in practise. Specifically I have some code like: text_lines = sc.textFile(input_files) records = text_lines.map(json.loads) records.persist(pyspark.StorageLevel.MEMORY_ONLY) count = records.count() records.unpersist() When I do not use persist at all the 'count' variable contains the correct value. When I use persist with pyspark.StorageLevel.MEMORY_AND_DISK, I also get the correct, expected value. However, if I use persist with no argument (or pyspark.StorageLevel.MEMORY_ONLY) then the value of 'count' is too small. In all cases the script completes without errors (or warning as far as I can tell). I'm using Spark 2.0.0 on an AWS EMR cluster. It appears that the executors may not have enough memory to store all the RDD partitions in memory only, however I thought in this case it would fall back to regenerating from the parent RDD, rather than providing the wrong answer. Is this the expected behaviour? It seems a little difficult to work with in practise. Cheers, Ben - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: MEMORY_ONLY vs MEMORY_AND_DISK
It depends. If the data size on which the calculation is to be done is very large than caching it with MEMORY_AND_DISK is useful. Even in this case MEMORY_AND_DISK is useful if the computation on the RDD is expensive. If the compution is very small than even for large data sets MEMORY_ONLY can be used. But if data size is small, than using MEMORY_ONLY is a obviously the best option. On Thu, Mar 19, 2015 at 2:35 AM, sergunok [via Apache Spark User List] ml-node+s1001560n22130...@n3.nabble.com wrote: What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130p22140.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
MEMORY_ONLY vs MEMORY_AND_DISK
What persistance level is better if RDD to be cached is heavily to be recalculated? Am I right it is MEMORY_AND_DISK? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MEMORY-ONLY-vs-MEMORY-AND-DISK-tp22130.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org