Re: Spark RDD and Memory

Aditya Thu, 22 Sep 2016 23:55:48 -0700

Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded into memoryafter performing some operations exceeds the memory, how it is handledby spark.Is previosly loaded rdds removed from memory to make it free forsubsequent steps in DAG?

I am running into an issue where my DAG is very long and all the datadoes not fits into memory and at some point all my executors gets lost.


On Friday 23 September 2016 12:15 PM, Aditya wrote:


Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded intomemory after performing some operations exceeds the memory, how it ishandled by spark.Is previosly loaded rdds removed from memory to make it free forsubsequent steps in DAG?

I am running into an issue where my DAG is very long and all the datadoes not fits into memory and at some point all my executors gets lost.



On Friday 23 September 2016 12:02 PM, Datta Khot wrote:

Hi Aditya,

If you cache the RDDs - like textFile.cache(),textFile1().cache() - then it will not load the data again from filesystem.

Once done with related operations it is recommended to uncache theRDDs to manage memory efficiently and avoid it's exhaustion.


Note caching operation is with main memory and persist is to disk.

Datta
https://in.linkedin.com/in/datta-khot-240b544
http://www.datasherpa.io/

On Fri, Sep 23, 2016 at 10:23 AM, Aditya<aditya.calangut...@augmentiq.co.in<mailto:aditya.calangut...@augmentiq.co.in>> wrote:


    Thanks for the reply.

    One more question.
    How spark handles data if it does not fit in memory? The answer
    which I got is that it flushes the data to disk and handle the
    memory issue.
    Plus in below example.
    val textFile = sc.textFile("/user/emp.txt")
    val textFile1 = sc.textFile("/user/emp1.xt")
    val join = textFile.join(textFile1)
    join.saveAsTextFile("/home/output")
    val count = join.count()

    When the first action is performed it loads textFile and
    textFile1 in memory, performes join and save the result.
    But when the second action (count) is called, it again loads
    textFile and textFile1 in memory and again performs the join
    operation?
    If it loads again what is the correct way to prevent it from
    loading again again the same data?


    On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:

    Hi,

    unpersist works on storage memory not execution memory. So I do
    not think you can flush it out of memory if you have not cached
    it using cache or something like below in the first place.

    s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

    s.unpersist

    I believe the recent versions of Spark deploy Least Recently
    Used (LRU) mechanism to flush unused data out of memory much
    like RBMS cache management. I know LLDAP does that.

    HTH



    Dr Mich Talebzadeh

    LinkedIn
    
/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

    http://talebzadehmich.wordpress.com


    *Disclaimer:* Use it at your own risk.Any and all responsibility
    for any loss, damage or destruction of data or any other
    property which may arise from relying on this
    email's technical content is explicitly disclaimed. The author
    will in no case be liable for any monetary damages arising from
    such loss, damage or destruction.


    On 22 September 2016 at 18:09, Hanumath Rao Maduri
    <hanu....@gmail.com> wrote:

        Hello Aditya,

        After an intermediate action has been applied you might want
        to call rdd.unpersist() to let spark know that this rdd is
        no longer required.

        Thanks,
        -Hanu

        On Thu, Sep 22, 2016 at 7:54 AM, Aditya
        <aditya.calangut...@augmentiq.co.in
        <mailto:aditya.calangut...@augmentiq.co.in>> wrote:

            Hi,

            Suppose I have two RDDs
            val textFile = sc.textFile("/user/emp.txt")
            val textFile1 = sc.textFile("/user/emp1.xt")

            Later I perform a join operation on above two RDDs
            val join = textFile.join(textFile1)

            And there are subsequent transformations without
            including textFile and textFile1 further and an action
            to start the execution.

            When action is called, textFile and textFile1 will be
            loaded in memory first. Later join will be performed and
            kept in memory.
            My question is once join is there memory and is used for
            subsequent execution, what happens to textFile and
            textFile1 RDDs. Are they still kept in memory untill the
            full lineage graph is completed or is it destroyed once
            its use is over? If it is kept in memory, is there any
            way I can explicitly remove it from memory to free the
            memory?





            
---------------------------------------------------------------------
            To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark RDD and Memory

Reply via email to