Hi, I'm "okay to add RocksDB StateStore as external module". See no reason not to.
Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <vii...@gmail.com> wrote: > Hi devs, > > In Spark structured streaming, we need state store for state management for > stateful operators such streaming aggregates, joins, etc. We have one and > only one state store implementation now. It is in-memory hashmap which was > backed up in HDFS complaint file system at the end of every micro-batch. > > As it basically uses in-memory map to store states, memory consumption is a > serious issue and state store size is limited by the size of the executor > memory. Moreover, state store using more memory means it may impact the > performance of task execution that requires memory too. > > Internally we see more streaming applications that requires large state in > stateful operations. For such requirements, we need a StateStore not rely > on > memory to store states. > > This seems to be also true externally as several other major streaming > frameworks already use RocksDB for state management. RocksDB is an embedded > DB and streaming engines can use it to store state instead of memory > storage. > > So seems to me, it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > > Previously there was one attempt SPARK-28120 to add RocksDB StateStore into > Spark SS. IIUC, it was pushed back due to two concerns: extra code > maintenance cost and it introduces RocksDB dependency. > > For the first concern, as more users require to use the feature, it should > be highly used code in SS and more developers will look at it. For second > one, we propose (SPARK-34198) to add it as an external module to relieve > the > dependency concern. > > Because it was pushed back previously, I'm going to raise this discussion > to > know what people think about it now, in advance of submitting any code. > > I think there might be some possible opinions: > > 1. okay to add RocksDB StateStore into sql core module > 2. not okay for 1, but okay to add RocksDB StateStore as external module > 3. either 1 or 2 is okay > 4. not okay to add RocksDB StateStore, no matter into sql core or as > external module > > Please let us know if you have some thoughts. > > Thank you. > > Liang-Chi Hsieh > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >