GitHub user nikunjb opened a pull request: https://github.com/apache/spark/pull/22424
[SPARK-25303][STREAMING] For checkpointed DStreams, remove the parent⦠â¦RemeberDuration - this will be recalculated to a minimum value ## How was this patch tested? ## What changes were proposed in this pull request? When a DStream gets checkpointed, there is no need to remember parentRDDs (unless indicated by other DStreams from that parent). This change sets the parentRememberDuration to null for checkpointed DStreams. Please note that this does get recalculated during validation to a minimum value in the parent as expected for any DStream as usual. Before this change, even after the fix for SPARK-25302 that cut the lineage to the parent, the parentDStreams and all the ones before that were being remembered for long durations. This was unnecessary and resulted in input RDDs being persisted for much longer than needed. Please see post below for original issue and a reply to it about resolution http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html A separate patch is being created for the issue in SPARK-25302 ## How was this patch tested? Running the existing Unit tests. ## What improvement does this patch make? When combined with the fix in SPARK-25302, unpersits the intermediate RDDs and Input DStreams that were being remembered for too long, resulting in much lower memory usage on the Executors. Observed from the DAG chart differences and data from the Storage tab on the Driver UI. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nikunjb/spark spark-25303 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22424.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22424 ---- commit 578c4c6beec4f9e26d9f0fed4f0083336ac498a7 Author: Nikunj Bansal <nikunj.bansal@...> Date: 2018-09-14T22:34:47Z [SPARK-25303][STREAMING] For checkpointed DStreams, remove the parentRemeberDuration - this will be recalculated to a minimum value ## What changes were proposed in this pull request? When a DStream gets checkpointed, there is no need to remember parentRDDs (unless indicated by other DStreams from that parent). This change sets the parentRememberDuration to null for checkpointed DStreams. Please note that this does get recalculated during validation to a minimum value in the parent as expected for any DStream as usual. Before this change, even after the fix for SPARK-25302 that cut the lineage to the parent, the parentDStreams and all the ones before that were being remembered for long durations. This was unnecessary and resulted in input RDDs being persisted for much longer than needed. Please see post below for original issue and a reply to it about resolution: http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html A separate patch is being created for the issue in SPARK-25302 ## How was this patch tested? Running the existing Unit tests. ## What improvement does this patch make? When combined with the fix in SPARK-25302, unpersits the intermediate RDDs and Input DStreams that were being remembered for too long, resulting in much lower memory usage on the Executors. Observed from the DAG chart differences and data from the Storage tab on the Driver UI. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org