Hi devs, In Spark structured streaming, we need state store for state management for stateful operators such streaming aggregates, joins, etc. We have one and only one state store implementation now. It is in-memory hashmap which was backed up in HDFS complaint file system at the end of every micro-batch.
As it basically uses in-memory map to store states, memory consumption is a serious issue and state store size is limited by the size of the executor memory. Moreover, state store using more memory means it may impact the performance of task execution that requires memory too. Internally we see more streaming applications that requires large state in stateful operations. For such requirements, we need a StateStore not rely on memory to store states. This seems to be also true externally as several other major streaming frameworks already use RocksDB for state management. RocksDB is an embedded DB and streaming engines can use it to store state instead of memory storage. So seems to me, it is proven to be good choice for large state usage. But Spark SS still lacks of a built-in state store for the requirement. Previously there was one attempt SPARK-28120 to add RocksDB StateStore into Spark SS. IIUC, it was pushed back due to two concerns: extra code maintenance cost and it introduces RocksDB dependency. For the first concern, as more users require to use the feature, it should be highly used code in SS and more developers will look at it. For second one, we propose (SPARK-34198) to add it as an external module to relieve the dependency concern. Because it was pushed back previously, I'm going to raise this discussion to know what people think about it now, in advance of submitting any code. I think there might be some possible opinions: 1. okay to add RocksDB StateStore into sql core module 2. not okay for 1, but okay to add RocksDB StateStore as external module 3. either 1 or 2 is okay 4. not okay to add RocksDB StateStore, no matter into sql core or as external module Please let us know if you have some thoughts. Thank you. Liang-Chi Hsieh -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org