Re: [DISCUSS] SPIP: State Data Source - Reader

Jungtaek Lim Mon, 23 Oct 2023 15:47:34 -0700

FYI: VOTE thread is open, please check the link
https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1
(committer+ can
login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in
your inbox. Every vote would be really appreciated!


On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim <[email protected]>
wrote:

> I don't see major comments as of now. Given that the thread was initiated
> more than 10 days ago and I see multiple supporters, I'm going to initiate
> a VOTE thread.
>
> Please participate in the VOTE thread as well. Thanks!
>
> On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <
> [email protected]> wrote:
>
>> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
>> is a rather general and usual question for every new addition of data
>> source. Hence I want to sort it out for everyone.
>>
>> As I know, the author implemented a third-party tool for query state
>>> store as a data source long time ago. I've suggested some users to use the
>>> tool before. It is a useful tool for special cases because there is no
>>> other tool/feature for the purpose.
>>> I think for such effort to add new data source, one usual question is
>>> why it has to be in Spark repo instead of as a third-party tool. Especially
>>> this is not a frequent used one. Even for structured stream users, only
>>> rare cases it is necessary to look into state store content.
>>
>>
>> I think we do not expect the data source to be used rarely. We see two
>> different major use cases; 1) unit tests against stateful query 2) look
>> into the state during the incident to get full context. 2) is probably not
>> something users may encounter this frequently, hence it is valid to say the
>> new feature may not be used frequently. But 1) is definitely something we
>> can say it's tied to daily work.
>>
>> Also, even 2), it looks to be an essential feature and has to be provided
>> as out-of-the-box. Let's say, this feature does not exist and an user
>> encounters an incident in production with a stateful query. During RCA,
>> they realize that state is a black-box and their only option is deducing
>> the value of the state indirectly, mostly likely requiring them to modify
>> the query heavily and put artificial inputs. If I were such a user, I would
>> consider this lack as a fundamental issue of SS. It has been out-of-the-box
>> in Flink for years (State Processor), so it also makes sense for
>> competitive points.
>>
>> We are seeing this effort as a stepping stone. As we see comments in SPIP
>> doc and also previous replies, people also see the proposal as a prior work
>> for writer part, which we would have a chance to break the strong
>> preconception for fixed number of shuffle partitions. I'd argue that this
>> is a rather fundamental limitation of SS and I have seen so many complaints
>> with this. I don't feel like it is right to delegate to a 3rd party to
>> solve the fundamental issue. This is probably stronger evidence than the
>> reader part.
>>
>> Here's another aspect, during the work, we observed the lacking parts on
>> checkpointing e.g. the information of prefix scan does not exist in the
>> checkpoint, which makes a big difference on restoring the state from the
>> state file. When we come to the state repartitioning, the repartition is
>> based on the grouping keys in the operator (not the state key), hence we
>> will also need additional information for that. If this feature goes into
>> the 3rd party, it will be very painful to make both sides of the changes
>> altogether. It brings up another headache, versioning and compatibility
>> matrix.
>>
>> I hope this would help persuade people to add this to the Spark repo
>> rather than its own life.
>>
>>
>> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
>> [email protected]> wrote:
>>
>>> Thanks Raghu for your support!
>>>
>>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>>> support from Chaoqin and Praveen. Thanks both!
>>>
>>>
>>>
>>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <
>>> [email protected]> wrote:
>>>
>>>> +1 overall and a big +1 to keeping offline state-rebalancing as a
>>>> primary use case.
>>>>
>>>> Raghu.
>>>>
>>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>>> [email protected]> wrote:
>>>>
>>>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>>>
>>>>> +1 for me. It seems like a prerequisite for further ops-related
>>>>> improvements for the state store management. I mean especially here the
>>>>> state rebalancing that could rely on this read+write state store API. I
>>>>> don't mean here the dynamic state rebalancing that could probably be
>>>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>>>> thinking more of an offline job to rebalance the state and later restart
>>>>> the stateful pipeline with the changed number of shuffle partitions.
>>>>>
>>>>> Best,
>>>>> Bartosz.
>>>>>
>>>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> bump for better reach
>>>>>>
>>>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Sorry, please use this link instead for SPIP doc:
>>>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi dev,
>>>>>>>>
>>>>>>>> I'd like to start a discussion on "State Data Source - Reader".
>>>>>>>>
>>>>>>>> This proposal aims to introduce a new data source "statestore"
>>>>>>>> which enables reading the state rows from existing checkpoint via 
>>>>>>>> offline
>>>>>>>> (batch) query. This will enable users to 1) create unit tests against
>>>>>>>> stateful query verifying the state value (especially
>>>>>>>> flatMapGroupsWithState), 2) gather more context on the status when an
>>>>>>>> incident occurs, especially for incorrect output.
>>>>>>>>
>>>>>>>> *SPIP*:
>>>>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>>>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>>>>>>
>>>>>>>> Looking forward to your feedback!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>> ps. The scope of the project is narrowed to the reader in this
>>>>>>>> SPIP, since the writer requires us to consider more cases. We are 
>>>>>>>> planning
>>>>>>>> on it.
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Bartosz Konieczny
>>>>> freelance data engineer
>>>>> https://www.waitingforcode.com
>>>>> https://github.com/bartosz25/
>>>>> https://twitter.com/waitingforcode
>>>>>
>>>>>

Re: [DISCUSS] SPIP: State Data Source - Reader

Reply via email to