Any update on this? On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <ar...@apache.org> wrote:
> I don't think separate API or RPCs etc might be necessary for queryable > state if the state can be exposed as just another datasource. Then the sql > queries can be issued against it just like executing sql queries against > any other data source. > > For now I think the "memory" sink could be used as a sink and run queries > against it but I agree it does not scale for large states. > > On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <kabh...@gmail.com> wrote: > >> It doesn't seem Spark has workarounds other than storing output into >> external storages, so +1 on having this. >> >> My major concern on implementing queryable state in structured streaming >> is "Are all states available on executors at any time while query is >> running?" Querying state shouldn't affect the running query. Given that >> state is huge and default state provider is loading state in memory, we may >> not want to load one more redundant snapshot of state: we want to always >> load "current state" which query is also using. (For sure, Queryable state >> should be read-only.) >> >> Regarding improvement of local state, I guess it is ideal to leverage >> embedded db, like Kafka and Flink are doing. The difference will not be >> only reading state from non-heap, but also how to take a snapshot and store >> delta. We may want to check snapshotting works well with small batch >> interval, and find alternative approach when it doesn't. Sounds like it is >> a huge item and can be handled individually. >> >> - Jungtaek Lim (HeartSaVioR) >> >> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <st.kontopou...@gmail.com>님이 >> 작성: >> >>> Nice I was looking for a jira. So I agree we should justify why we are >>> building something. Now to that direction here is what I have seen from my >>> experience. >>> People quite often use state within their streaming app and may have >>> large states (TBs). Shortening the pipeline by not having to copy data (to >>> Cassandra for example for serving) is an advantage, in terms of at least >>> latency and complexity. >>> This can be true if we advantage of state checkpointing (locally could >>> be RocksDB or in general HDFS the latter is currently supported) along >>> with an API to efficiently query data. >>> Some use cases I see: >>> >>> - real-time dashboards and real-time reporting, the faster the better >>> - monitoring of state for operational reasons, app health etc... >>> - integrating with external services via an API eg. making accessible >>> aggregations over time windows to some third party service within your >>> system >>> >>> Regarding requirements here are some of them: >>> - support of an API to expose state (could be done at the spark driver), >>> like rest. >>> - supporting dynamic allocation (not sure how it affects state >>> management) >>> - an efficient way to talk to executors to get the state (rpc?) >>> - making local state more efficient and easier accessible with an >>> embedded db (I dont think this is supported from what I see, maybe wrong)? >>> Some people are already working with such techs and some stuff could be >>> re-used: https://issues.apache.org/jira/browse/SPARK-20641 >>> >>> Best, >>> Stavros >>> >>> >>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> https://issues.apache.org/jira/browse/SPARK-16738 >>>> >>>> I don't believe anyone is working on it yet. I think the most useful >>>> thing is to start enumerating requirements and use cases and then we can >>>> talk about how to build it. >>>> >>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos < >>>> st.kontopou...@gmail.com> wrote: >>>> >>>>> Cool Burak do you have a pointer, should I take the initiative for a >>>>> first design document or Databricks is working on it? >>>>> >>>>> Best, >>>>> Stavros >>>>> >>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <brk...@gmail.com> wrote: >>>>> >>>>>> Hi Stavros, >>>>>> >>>>>> Queryable state is definitely on the roadmap! We will revamp the >>>>>> StateStore API a bit, and a queryable StateStore is definitely one of the >>>>>> things we are thinking about during that revamp. >>>>>> >>>>>> Best, >>>>>> Burak >>>>>> >>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" < >>>>>> st.kontopou...@gmail.com> wrote: >>>>>> >>>>>>> Just to re-phrase my question: Would query-able state make a viable >>>>>>> SPIP? >>>>>>> >>>>>>> Regards, >>>>>>> Stavros >>>>>>> >>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos < >>>>>>> st.kontopou...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Maybe this has been discussed before. Given the fact that many >>>>>>>> streaming apps out there use state extensively, could be a good idea to >>>>>>>> make Spark expose streaming state with an external API like other >>>>>>>> systems do (Kafka streams, Flink etc), in order to facilitate >>>>>>>> interactive queries? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Stavros >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>>