Re: queryable state & streaming

Stavros Kontopoulos Wed, 24 Apr 2019 11:25:32 -0700

Michael,
I have listed used cases above should we proceed with a design doc?


Best,
Stavros

Στις Δευ, 18 Μαρ 2019, 12:21 μ.μ. ο χρήστης Stavros Kontopoulos <
[email protected]> έγραψε:

> Not really, if we agree that we want this, I can put together a design
> document and take it from there. There was also a discussion in another
> thread about adding RockDB as a memory storage that is related to this task.
>
> Best,
> Stavros
>
> On Sun, Mar 17, 2019 at 4:42 AM kant kodali <[email protected]> wrote:
>
>> Any update on this?
>>
>> On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <[email protected]> wrote:
>>
>>> I don't think separate API or RPCs etc might be necessary for queryable
>>> state if the state can be exposed as just another datasource. Then the sql
>>> queries can be issued against it just like executing sql queries against
>>> any other data source.
>>>
>>> For now I think the "memory" sink could be used  as a sink and run
>>> queries against it but I agree it does not scale for large states.
>>>
>>> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <[email protected]> wrote:
>>>
>>>> It doesn't seem Spark has workarounds other than storing output into
>>>> external storages, so +1 on having this.
>>>>
>>>> My major concern on implementing queryable state in structured
>>>> streaming is "Are all states available on executors at any time while query
>>>> is running?" Querying state shouldn't affect the running query. Given that
>>>> state is huge and default state provider is loading state in memory, we may
>>>> not want to load one more redundant snapshot of state: we want to always
>>>> load "current state" which query is also using. (For sure, Queryable state
>>>> should be read-only.)
>>>>
>>>> Regarding improvement of local state, I guess it is ideal to leverage
>>>> embedded db, like Kafka and Flink are doing. The difference will not be
>>>> only reading state from non-heap, but also how to take a snapshot and store
>>>> delta. We may want to check snapshotting works well with small batch
>>>> interval, and find alternative approach when it doesn't. Sounds like it is
>>>> a huge item and can be handled individually.
>>>>
>>>> - Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <
>>>> [email protected]>님이 작성:
>>>>
>>>>> Nice I was looking for a jira. So I agree we should justify why we are
>>>>> building something. Now to that direction here is what I have seen from my
>>>>> experience.
>>>>> People quite often use state within their streaming app and may have
>>>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>>>> Cassandra for example for serving) is an advantage, in terms of at least
>>>>> latency and complexity.
>>>>> This can be true if we advantage of state checkpointing (locally could
>>>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>>>> with an API to efficiently query data.
>>>>> Some use cases I see:
>>>>>
>>>>> - real-time dashboards and real-time reporting, the faster the better
>>>>> - monitoring of state for operational reasons, app health etc...
>>>>> - integrating with external services via an API eg. making accessible
>>>>>  aggregations over time windows to some third party service within your
>>>>> system
>>>>>
>>>>> Regarding requirements here are some of them:
>>>>> - support of an API to expose state (could be done at the spark
>>>>> driver), like rest.
>>>>> - supporting dynamic allocation (not sure how it affects state
>>>>> management)
>>>>> - an efficient way to talk to executors to get the state (rpc?)
>>>>> - making local state more efficient and easier accessible with an
>>>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>>>> Some people are already working with such techs and some stuff could
>>>>> be re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>>>
>>>>> Best,
>>>>> Stavros
>>>>>
>>>>>
>>>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-16738
>>>>>>
>>>>>> I don't believe anyone is working on it yet.  I think the most useful
>>>>>> thing is to start enumerating requirements and use cases and then we can
>>>>>> talk about how to build it.
>>>>>>
>>>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Cool Burak do you have a pointer, should I take the initiative for a
>>>>>>> first design document or Databricks is working on it?
>>>>>>>
>>>>>>> Best,
>>>>>>> Stavros
>>>>>>>
>>>>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Stavros,
>>>>>>>>
>>>>>>>> Queryable state is definitely on the roadmap! We will revamp the
>>>>>>>> StateStore API a bit, and a queryable StateStore is definitely one of 
>>>>>>>> the
>>>>>>>> things we are thinking about during that revamp.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Burak
>>>>>>>>
>>>>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Just to re-phrase my question: Would query-able state make a
>>>>>>>>> viable SPIP?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Stavros
>>>>>>>>>
>>>>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Maybe this has been discussed before. Given the fact that many
>>>>>>>>>> streaming apps out there use state extensively, could be a good idea 
>>>>>>>>>> to
>>>>>>>>>> make Spark expose streaming state with an external API like
>>>>>>>>>> other systems do (Kafka streams, Flink etc), in order to
>>>>>>>>>> facilitate interactive queries?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Stavros
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>
>

Re: queryable state & streaming

Reply via email to