GitHub user nikunjb opened a pull request:

    https://github.com/apache/spark/pull/22424

    [SPARK-25303][STREAMING] For checkpointed DStreams, remove the parent…

    …RemeberDuration - this will be recalculated to a minimum value
    
            ## How was this patch tested?
    
    
    ## What changes were proposed in this pull request?
    
    When a DStream gets checkpointed, there is no need to remember
            parentRDDs (unless indicated by other DStreams from that parent).
            This change sets the parentRememberDuration to null for checkpointed
            DStreams. Please note that this does get recalculated during 
validation
            to a minimum value in the parent as expected for any DStream as 
usual.
            Before this change, even after the fix for SPARK-25302 that cut the
            lineage to the parent, the parentDStreams and all the ones before 
that
            were being remembered for long durations. This was unnecessary and
            resulted in input RDDs being persisted for much longer than needed.
    
            Please see post below for original issue and a reply to it about
            resolution
    
            
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
    
            A separate patch is being created for the issue in SPARK-25302
    
    ## How was this patch tested?
    
            Running the existing Unit tests.
    
    ## What improvement does this patch make?
    
            When combined with the fix in SPARK-25302,
            unpersits the intermediate RDDs and Input DStreams that
            were being remembered for too long, resulting in much lower memory 
usage
            on the Executors. Observed from the DAG chart differences and
            data from the Storage tab on the Driver UI.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nikunjb/spark spark-25303

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22424.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22424
    
----
commit 578c4c6beec4f9e26d9f0fed4f0083336ac498a7
Author: Nikunj Bansal <nikunj.bansal@...>
Date:   2018-09-14T22:34:47Z

    [SPARK-25303][STREAMING] For checkpointed DStreams, remove the 
parentRemeberDuration - this will be recalculated to a minimum value
    
            ## What changes were proposed in this pull request?
    
            When a DStream gets checkpointed, there is no need to remember
            parentRDDs (unless indicated by other DStreams from that parent).
            This change sets the parentRememberDuration to null for checkpointed
            DStreams. Please note that this does get recalculated during 
validation
            to a minimum value in the parent as expected for any DStream as 
usual.
            Before this change, even after the fix for SPARK-25302 that cut the
            lineage to the parent, the parentDStreams and all the ones before 
that
            were being remembered for long durations. This was unnecessary and
            resulted in input RDDs being persisted for much longer than needed.
    
            Please see post below for original issue and a reply to it about
            resolution:
    
            
http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html
    
            A separate patch is being created for the issue in SPARK-25302
    
            ## How was this patch tested?
    
            Running the existing Unit tests.
    
            ## What improvement does this patch make?
    
            When combined with the fix in SPARK-25302,
            unpersits the intermediate RDDs and Input DStreams that
            were being remembered for too long, resulting in much lower memory 
usage
            on the Executors. Observed from the DAG chart differences and
            data from the Storage tab on the Driver UI.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to