Michael, I have listed used cases above should we proceed with a design doc?
Best, Stavros Στις Δευ, 18 Μαρ 2019, 12:21 μ.μ. ο χρήστης Stavros Kontopoulos < stavros.kontopou...@lightbend.com> έγραψε: > Not really, if we agree that we want this, I can put together a design > document and take it from there. There was also a discussion in another > thread about adding RockDB as a memory storage that is related to this task. > > Best, > Stavros > > On Sun, Mar 17, 2019 at 4:42 AM kant kodali <kanth...@gmail.com> wrote: > >> Any update on this? >> >> On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <ar...@apache.org> wrote: >> >>> I don't think separate API or RPCs etc might be necessary for queryable >>> state if the state can be exposed as just another datasource. Then the sql >>> queries can be issued against it just like executing sql queries against >>> any other data source. >>> >>> For now I think the "memory" sink could be used as a sink and run >>> queries against it but I agree it does not scale for large states. >>> >>> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <kabh...@gmail.com> wrote: >>> >>>> It doesn't seem Spark has workarounds other than storing output into >>>> external storages, so +1 on having this. >>>> >>>> My major concern on implementing queryable state in structured >>>> streaming is "Are all states available on executors at any time while query >>>> is running?" Querying state shouldn't affect the running query. Given that >>>> state is huge and default state provider is loading state in memory, we may >>>> not want to load one more redundant snapshot of state: we want to always >>>> load "current state" which query is also using. (For sure, Queryable state >>>> should be read-only.) >>>> >>>> Regarding improvement of local state, I guess it is ideal to leverage >>>> embedded db, like Kafka and Flink are doing. The difference will not be >>>> only reading state from non-heap, but also how to take a snapshot and store >>>> delta. We may want to check snapshotting works well with small batch >>>> interval, and find alternative approach when it doesn't. Sounds like it is >>>> a huge item and can be handled individually. >>>> >>>> - Jungtaek Lim (HeartSaVioR) >>>> >>>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos < >>>> st.kontopou...@gmail.com>님이 작성: >>>> >>>>> Nice I was looking for a jira. So I agree we should justify why we are >>>>> building something. Now to that direction here is what I have seen from my >>>>> experience. >>>>> People quite often use state within their streaming app and may have >>>>> large states (TBs). Shortening the pipeline by not having to copy data (to >>>>> Cassandra for example for serving) is an advantage, in terms of at least >>>>> latency and complexity. >>>>> This can be true if we advantage of state checkpointing (locally could >>>>> be RocksDB or in general HDFS the latter is currently supported) along >>>>> with an API to efficiently query data. >>>>> Some use cases I see: >>>>> >>>>> - real-time dashboards and real-time reporting, the faster the better >>>>> - monitoring of state for operational reasons, app health etc... >>>>> - integrating with external services via an API eg. making accessible >>>>> aggregations over time windows to some third party service within your >>>>> system >>>>> >>>>> Regarding requirements here are some of them: >>>>> - support of an API to expose state (could be done at the spark >>>>> driver), like rest. >>>>> - supporting dynamic allocation (not sure how it affects state >>>>> management) >>>>> - an efficient way to talk to executors to get the state (rpc?) >>>>> - making local state more efficient and easier accessible with an >>>>> embedded db (I dont think this is supported from what I see, maybe wrong)? >>>>> Some people are already working with such techs and some stuff could >>>>> be re-used: https://issues.apache.org/jira/browse/SPARK-20641 >>>>> >>>>> Best, >>>>> Stavros >>>>> >>>>> >>>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust < >>>>> mich...@databricks.com> wrote: >>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-16738 >>>>>> >>>>>> I don't believe anyone is working on it yet. I think the most useful >>>>>> thing is to start enumerating requirements and use cases and then we can >>>>>> talk about how to build it. >>>>>> >>>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos < >>>>>> st.kontopou...@gmail.com> wrote: >>>>>> >>>>>>> Cool Burak do you have a pointer, should I take the initiative for a >>>>>>> first design document or Databricks is working on it? >>>>>>> >>>>>>> Best, >>>>>>> Stavros >>>>>>> >>>>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <brk...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Stavros, >>>>>>>> >>>>>>>> Queryable state is definitely on the roadmap! We will revamp the >>>>>>>> StateStore API a bit, and a queryable StateStore is definitely one of >>>>>>>> the >>>>>>>> things we are thinking about during that revamp. >>>>>>>> >>>>>>>> Best, >>>>>>>> Burak >>>>>>>> >>>>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" < >>>>>>>> st.kontopou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Just to re-phrase my question: Would query-able state make a >>>>>>>>> viable SPIP? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Stavros >>>>>>>>> >>>>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos < >>>>>>>>> st.kontopou...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Maybe this has been discussed before. Given the fact that many >>>>>>>>>> streaming apps out there use state extensively, could be a good idea >>>>>>>>>> to >>>>>>>>>> make Spark expose streaming state with an external API like >>>>>>>>>> other systems do (Kafka streams, Flink etc), in order to >>>>>>>>>> facilitate interactive queries? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Stavros >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> > >