GitHub user nikunjb opened a pull request: https://github.com/apache/spark/pull/22423
[SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindo⦠â¦wDStream so as to cut the lineage completely to parent ## What changes were proposed in this pull request? Dstream.reduceByKeyAndWindow() with inverse reduce functions eventually creates a ReducedWindowDStream but did not checkpoint it. When combined with the issue described in SPARK-25303, it results in the problem described in the post here: http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html This change will checkpoint reducedStream inside to cut the lineage. A separate patch is being created for the issue in SPARK-25303 ## How was this patch tested? Running the existing Unit tests. A couple of existing unit test classes were written assuming they will not be used in serialization. Had to declare SparkContext as transient like other tests in order to make them work due to new checkpointing causing serialization to occur. ## What improvement does this patch make? Cuts the lineage to the parent DStream. When combined with the fix in SPARK-25303, unpersits the intermediate RDDs and Input DStreams that were being remembered for too long, resulting in much lower memory usage on the Executors. Observed from the DAG chart differences and data from the Storage tab on the Driver UI. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nikunjb/spark spark-25302 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22423 ---- commit 36929ad2424861e1a66bb8c867b240e1aec8febc Author: Nikunj Bansal <nikunj.bansal@...> Date: 2018-09-14T22:37:36Z [SPARK-25302][STREAMING] Checkpoint the reducedStream in ReducedWindowDStream so as to cut the lineage completely to parent ## What changes were proposed in this pull request? Dstream.reduceByKeyAndWindow() with inverse reduce functions eventually creates a ReducedWindowDStream but did not checkpoint it. When combined with the issue described in SPARK-25303, it results in the problem described in the post here: http://apache-spark-user-list.1001560.n3.nabble.com/DStream-reduceByKeyAndWindow-not-using-checkpointed-data-for-inverse-reducing-old-data-td33332.html This change will checkpoint reducedStream inside to cut the lineage. A separate patch is being created for the issue in SPARK-25303 ## How was this patch tested? Running the existing Unit tests. A couple of unit test classes were written assuming they will not be used in serialization. Had to declare SparkContext as transient like other tests in order to make them work due to new checkpointing causing serialization to occur. ## What improvement does this patch make? Cuts the lineage to the parent DStream. When combined with the fix in SPARK-25303, unpersits the intermediate RDDs and Input DStreams that were being remembered for too long, resulting in much lower memory usage on the Executors. Observed from the DAG chart differences and data from the Storage tab on the Driver UI. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org