Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21500 I agree that current cache approach may consume excessive memory unnecessarily, and that's also same to my finding in #21469. The issue is not that simple however, because in micro-batch mode, each batch should read previous version of state, otherwise it should read from file system, in worst case seeking and reading multiple files in remote file system. So previous version of state is encouraged to be available in memory. There're three cases here (please add if I'm missing here): 1. fail before commit 2. committed but batch failed afterwards 3. committed and batch succeeds. It might be better to think about all the cases.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org