GitHub user HeartSaVioR opened a pull request: https://github.com/apache/spark/pull/21700
SPARK-24717 Split out min retain version of state for memory in HDFSBackedStateStoreProvider ## What changes were proposed in this pull request? This patch proposes breaking down configuration of retaining batch size on state into two pieces: files and in memory (cache). While this patch reuses existing configuration for files, it introduces new configuration, "spark.sql.streaming.maxBatchesToRetainInMemory" to configure max count of batch to retain in memory. This patch also introduces BoundedSortedMap to retain at most first N elements (sorted by key) which can be leveraged in loadedMaps in HDFSBackedStateStoreProvider. ## How was this patch tested? Apply this patch on top of SPARK-24441 (https://github.com/apache/spark/pull/21469), and manually tested to ensure overall size of state is around 2x or less instead of 10x ~ 80x according to various workloads. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HeartSaVioR/spark SPARK-24717 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21700.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21700 ---- commit 22f0e220f661b5457584ef83b1ecddc18212fa73 Author: Jungtaek Lim <kabhwan@...> Date: 2018-07-02T22:04:49Z SPARK-24717 Split out min retain version of state for memory in HDFSBackedStateStoreProvider * introduce BoundedSortedMap which implements bounded size of sorted map * only first N elements will be retained * replace loadedMaps to BoundedSortedMap to retain only N versions of states * no need to cleanup in maintenance phase * introduce new configuration: spark.sql.streaming.minBatchesToRetainInMemory ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org