FYI: VOTE thread is open, please check the link https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1 (committer+ can login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in your inbox. Every vote would be really appreciated!
On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > I don't see major comments as of now. Given that the thread was initiated > more than 10 days ago and I see multiple supporters, I'm going to initiate > a VOTE thread. > > Please participate in the VOTE thread as well. Thanks! > > On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it >> is a rather general and usual question for every new addition of data >> source. Hence I want to sort it out for everyone. >> >> As I know, the author implemented a third-party tool for query state >>> store as a data source long time ago. I've suggested some users to use the >>> tool before. It is a useful tool for special cases because there is no >>> other tool/feature for the purpose. >>> I think for such effort to add new data source, one usual question is >>> why it has to be in Spark repo instead of as a third-party tool. Especially >>> this is not a frequent used one. Even for structured stream users, only >>> rare cases it is necessary to look into state store content. >> >> >> I think we do not expect the data source to be used rarely. We see two >> different major use cases; 1) unit tests against stateful query 2) look >> into the state during the incident to get full context. 2) is probably not >> something users may encounter this frequently, hence it is valid to say the >> new feature may not be used frequently. But 1) is definitely something we >> can say it's tied to daily work. >> >> Also, even 2), it looks to be an essential feature and has to be provided >> as out-of-the-box. Let's say, this feature does not exist and an user >> encounters an incident in production with a stateful query. During RCA, >> they realize that state is a black-box and their only option is deducing >> the value of the state indirectly, mostly likely requiring them to modify >> the query heavily and put artificial inputs. If I were such a user, I would >> consider this lack as a fundamental issue of SS. It has been out-of-the-box >> in Flink for years (State Processor), so it also makes sense for >> competitive points. >> >> We are seeing this effort as a stepping stone. As we see comments in SPIP >> doc and also previous replies, people also see the proposal as a prior work >> for writer part, which we would have a chance to break the strong >> preconception for fixed number of shuffle partitions. I'd argue that this >> is a rather fundamental limitation of SS and I have seen so many complaints >> with this. I don't feel like it is right to delegate to a 3rd party to >> solve the fundamental issue. This is probably stronger evidence than the >> reader part. >> >> Here's another aspect, during the work, we observed the lacking parts on >> checkpointing e.g. the information of prefix scan does not exist in the >> checkpoint, which makes a big difference on restoring the state from the >> state file. When we come to the state repartitioning, the repartition is >> based on the grouping keys in the operator (not the state key), hence we >> will also need additional information for that. If this feature goes into >> the 3rd party, it will be very painful to make both sides of the changes >> altogether. It brings up another headache, versioning and compatibility >> matrix. >> >> I hope this would help persuade people to add this to the Spark repo >> rather than its own life. >> >> >> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Thanks Raghu for your support! >>> >>> Btw, I'd like to replicate the support from JIRA ticket itself, I see >>> support from Chaoqin and Praveen. Thanks both! >>> >>> >>> >>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi < >>> raghu.ang...@databricks.com> wrote: >>> >>>> +1 overall and a big +1 to keeping offline state-rebalancing as a >>>> primary use case. >>>> >>>> Raghu. >>>> >>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >>>> bartkoniec...@gmail.com> wrote: >>>> >>>>> Thank you, Jungtaek, for your answers! It's clear now. >>>>> >>>>> +1 for me. It seems like a prerequisite for further ops-related >>>>> improvements for the state store management. I mean especially here the >>>>> state rebalancing that could rely on this read+write state store API. I >>>>> don't mean here the dynamic state rebalancing that could probably be >>>>> implemented with a lower latency directly in the stateful API. Instead I'm >>>>> thinking more of an offline job to rebalance the state and later restart >>>>> the stateful pipeline with the changed number of shuffle partitions. >>>>> >>>>> Best, >>>>> Bartosz. >>>>> >>>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> bump for better reach >>>>>> >>>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Sorry, please use this link instead for SPIP doc: >>>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >>>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi dev, >>>>>>>> >>>>>>>> I'd like to start a discussion on "State Data Source - Reader". >>>>>>>> >>>>>>>> This proposal aims to introduce a new data source "statestore" >>>>>>>> which enables reading the state rows from existing checkpoint via >>>>>>>> offline >>>>>>>> (batch) query. This will enable users to 1) create unit tests against >>>>>>>> stateful query verifying the state value (especially >>>>>>>> flatMapGroupsWithState), 2) gather more context on the status when an >>>>>>>> incident occurs, especially for incorrect output. >>>>>>>> >>>>>>>> *SPIP*: >>>>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >>>>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >>>>>>>> >>>>>>>> Looking forward to your feedback! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>> >>>>>>>> ps. The scope of the project is narrowed to the reader in this >>>>>>>> SPIP, since the writer requires us to consider more cases. We are >>>>>>>> planning >>>>>>>> on it. >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Bartosz Konieczny >>>>> freelance data engineer >>>>> https://www.waitingforcode.com >>>>> https://github.com/bartosz25/ >>>>> https://twitter.com/waitingforcode >>>>> >>>>>