+1 to add, no matter to add under sql-core vs external module. Rationalization for myself:
* The discussion thread and voices here show strong demand for adding RocksDB state store out of the box. * No workaround on huge state store problem out of the box. Direct competitors on streaming frameworks provide it for years. * Maintenance cost is the major concern when evaluating to add something, but it can't be applied here, as contributors/committers from various companies are willing to contribute. * Apache Bahir project is no longer something being maintained actively - the last release was in September 2019 based on Spark 2.4.0. We can no longer easily say "let's add to Bahir instead". On Tue, Feb 9, 2021 at 3:22 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > +1 adding it any way. > > On Mon, 8 Feb 2021, 21:54 Holden Karau, <hol...@pigscanfly.ca> wrote: > >> +1 for an external module. >> >> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <chen...@fb.com.invalid> wrote: >> >>> +1 for (2) adding to external module. >>> >>> I think this feature is useful and popular in practice, and option 2 is >>> not conflict with previous concern for dependency. >>> >>> >>> >>> Thanks, >>> >>> Cheng Su >>> >>> >>> >>> *From: *Dongjoon Hyun <dongjoon.h...@gmail.com> >>> *Date: *Monday, February 8, 2021 at 10:39 AM >>> *To: *Jacek Laskowski <ja...@japila.pl> >>> *Cc: *Liang-Chi Hsieh <vii...@gmail.com>, dev <dev@spark.apache.org> >>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore >>> >>> >>> >>> Thank you, Liang-chi and all. >>> >>> >>> >>> +1 for (2) external module design because it can deliver the new feature >>> in a safe way. >>> >>> >>> >>> Bests, >>> >>> Dongjoon >>> >>> >>> >>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <ja...@japila.pl> wrote: >>> >>> Hi, >>> >>> >>> >>> I'm "okay to add RocksDB StateStore as external module". See no reason >>> not to. >>> >>> >>> Pozdrawiam, >>> >>> Jacek Laskowski >>> >>> ---- >>> >>> https://about.me/JacekLaskowski >>> >>> "The Internals Of" Online Books <https://books.japila.pl/> >>> >>> Follow me on https://twitter.com/jaceklaskowski >>> >>> >>> <https://twitter.com/jaceklaskowski> >>> >>> >>> >>> >>> >>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <vii...@gmail.com> wrote: >>> >>> Hi devs, >>> >>> In Spark structured streaming, we need state store for state management >>> for >>> stateful operators such streaming aggregates, joins, etc. We have one and >>> only one state store implementation now. It is in-memory hashmap which >>> was >>> backed up in HDFS complaint file system at the end of every micro-batch. >>> >>> As it basically uses in-memory map to store states, memory consumption >>> is a >>> serious issue and state store size is limited by the size of the executor >>> memory. Moreover, state store using more memory means it may impact the >>> performance of task execution that requires memory too. >>> >>> Internally we see more streaming applications that requires large state >>> in >>> stateful operations. For such requirements, we need a StateStore not >>> rely on >>> memory to store states. >>> >>> This seems to be also true externally as several other major streaming >>> frameworks already use RocksDB for state management. RocksDB is an >>> embedded >>> DB and streaming engines can use it to store state instead of memory >>> storage. >>> >>> So seems to me, it is proven to be good choice for large state usage. But >>> Spark SS still lacks of a built-in state store for the requirement. >>> >>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore >>> into >>> Spark SS. IIUC, it was pushed back due to two concerns: extra code >>> maintenance cost and it introduces RocksDB dependency. >>> >>> For the first concern, as more users require to use the feature, it >>> should >>> be highly used code in SS and more developers will look at it. For second >>> one, we propose (SPARK-34198) to add it as an external module to relieve >>> the >>> dependency concern. >>> >>> Because it was pushed back previously, I'm going to raise this >>> discussion to >>> know what people think about it now, in advance of submitting any code. >>> >>> I think there might be some possible opinions: >>> >>> 1. okay to add RocksDB StateStore into sql core module >>> 2. not okay for 1, but okay to add RocksDB StateStore as external module >>> 3. either 1 or 2 is okay >>> 4. not okay to add RocksDB StateStore, no matter into sql core or as >>> external module >>> >>> Please let us know if you have some thoughts. >>> >>> Thank you. >>> >>> Liang-Chi Hsieh >>> >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >