Re: queryable state & streaming

kant kodali Sat, 16 Mar 2019 19:42:57 -0700

Any update on this?

On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <[email protected]> wrote:


> I don't think separate API or RPCs etc might be necessary for queryable
> state if the state can be exposed as just another datasource. Then the sql
> queries can be issued against it just like executing sql queries against
> any other data source.
>
> For now I think the "memory" sink could be used  as a sink and run queries
> against it but I agree it does not scale for large states.
>
> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <[email protected]> wrote:
>
>> It doesn't seem Spark has workarounds other than storing output into
>> external storages, so +1 on having this.
>>
>> My major concern on implementing queryable state in structured streaming
>> is "Are all states available on executors at any time while query is
>> running?" Querying state shouldn't affect the running query. Given that
>> state is huge and default state provider is loading state in memory, we may
>> not want to load one more redundant snapshot of state: we want to always
>> load "current state" which query is also using. (For sure, Queryable state
>> should be read-only.)
>>
>> Regarding improvement of local state, I guess it is ideal to leverage
>> embedded db, like Kafka and Flink are doing. The difference will not be
>> only reading state from non-heap, but also how to take a snapshot and store
>> delta. We may want to check snapshotting works well with small batch
>> interval, and find alternative approach when it doesn't. Sounds like it is
>> a huge item and can be handled individually.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <[email protected]>님이
>> 작성:
>>
>>> Nice I was looking for a jira. So I agree we should justify why we are
>>> building something. Now to that direction here is what I have seen from my
>>> experience.
>>> People quite often use state within their streaming app and may have
>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>> Cassandra for example for serving) is an advantage, in terms of at least
>>> latency and complexity.
>>> This can be true if we advantage of state checkpointing (locally could
>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>> with an API to efficiently query data.
>>> Some use cases I see:
>>>
>>> - real-time dashboards and real-time reporting, the faster the better
>>> - monitoring of state for operational reasons, app health etc...
>>> - integrating with external services via an API eg. making accessible
>>>  aggregations over time windows to some third party service within your
>>> system
>>>
>>> Regarding requirements here are some of them:
>>> - support of an API to expose state (could be done at the spark driver),
>>> like rest.
>>> - supporting dynamic allocation (not sure how it affects state
>>> management)
>>> - an efficient way to talk to executors to get the state (rpc?)
>>> - making local state more efficient and easier accessible with an
>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>> Some people are already working with such techs and some stuff could be
>>> re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>
>>> Best,
>>> Stavros
>>>
>>>
>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>> [email protected]> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/SPARK-16738
>>>>
>>>> I don't believe anyone is working on it yet.  I think the most useful
>>>> thing is to start enumerating requirements and use cases and then we can
>>>> talk about how to build it.
>>>>
>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
>>>> [email protected]> wrote:
>>>>
>>>>> Cool Burak do you have a pointer, should I take the initiative for a
>>>>> first design document or Databricks is working on it?
>>>>>
>>>>> Best,
>>>>> Stavros
>>>>>
>>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <[email protected]> wrote:
>>>>>
>>>>>> Hi Stavros,
>>>>>>
>>>>>> Queryable state is definitely on the roadmap! We will revamp the
>>>>>> StateStore API a bit, and a queryable StateStore is definitely one of the
>>>>>> things we are thinking about during that revamp.
>>>>>>
>>>>>> Best,
>>>>>> Burak
>>>>>>
>>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Just to re-phrase my question: Would query-able state make a viable
>>>>>>> SPIP?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Stavros
>>>>>>>
>>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Maybe this has been discussed before. Given the fact that many
>>>>>>>> streaming apps out there use state extensively, could be a good idea to
>>>>>>>> make Spark expose streaming state with an external API like other
>>>>>>>> systems do (Kafka streams, Flink etc), in order to facilitate
>>>>>>>> interactive queries?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Stavros
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>

Re: queryable state & streaming

Reply via email to