+1

发自我的iPhone


------------------ Original ------------------
From: Jungtaek Lim <kabhwan.opensou...@gmail.com&gt;
Date: Tue,Oct 24,2023 6:47 AM
To: dev <dev@spark.apache.org&gt;
Subject: Re: [DISCUSS] SPIP: State Data Source - Reader



FYI: VOTE thread is open, please check the 
link&nbsp;https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1&nbsp;(committer+&nbsp;can
 login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in 
your inbox. Every vote would be really appreciated!

On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

I don't see major comments as of now. Given that the thread was initiated more 
than 10 days ago and I see multiple supporters, I'm going to initiate a VOTE 
thread.&nbsp;

Please participate in the&nbsp;VOTE thread as well. Thanks!


On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it is a 
rather general and usual question for every new addition of data source. Hence 
I want to sort it out for everyone.


As I know, the author implemented a third-party tool for query state store as a 
data source long time ago. I've suggested some users to use the tool before. It 
is a useful tool for special cases because there is no other tool/feature for 
the purpose.
I think for such effort to add new data source, one usual question is why it 
has to be in Spark repo instead of as a third-party tool. Especially this is 
not a frequent used one. Even for structured stream users, only rare cases it 
is necessary to look into state store content.

I think we do not expect the data source to be used rarely. We see two 
different major use cases; 1) unit tests against stateful query 2) look into 
the state during the incident to get full context. 2) is probably not something 
users may encounter this frequently, hence it is valid to say the new feature 
may not be used frequently. But 1) is definitely something we can say it's tied 
to daily work.


Also, even 2), it looks to be an essential feature and has to be provided as 
out-of-the-box. Let's say, this feature does not exist and an user encounters 
an incident in production with a stateful query. During RCA, they realize that 
state is a black-box and their only option is deducing the value of the state 
indirectly, mostly likely requiring them to modify the query heavily and put 
artificial inputs. If I were such a user, I would consider this lack as a 
fundamental issue of SS. It has been out-of-the-box in Flink for years (State 
Processor), so it also&nbsp;makes sense for competitive points.


We are seeing this effort as a stepping stone. As we see comments in SPIP doc 
and also previous replies, people also see&nbsp;the proposal as a prior work 
for writer part, which we would have a chance to break the strong preconception 
for fixed number of shuffle partitions. I'd argue that this is a rather 
fundamental limitation of SS and I have seen so many complaints with this. I 
don't feel like it is right to delegate to a 3rd party to solve the fundamental 
issue. This is probably stronger evidence than the reader part.


Here's another aspect, during the work, we observed the lacking parts on 
checkpointing e.g. the information of prefix scan does not exist in the 
checkpoint, which makes a big difference on restoring the state from the state 
file. When we come to the state repartitioning, the repartition is based on the 
grouping keys in the operator (not the state key), hence we will also need 
additional information for that. If this feature goes into the 3rd party, it 
will be very painful to make both sides of the changes altogether. It brings up 
another headache, versioning and compatibility matrix.


I hope this would help persuade people to add this to the Spark repo rather 
than its own life.
&nbsp;



On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

Thanks Raghu for your support!

Btw, I'd like to replicate the support from JIRA ticket itself, I see 
support&nbsp;from Chaoqin and Praveen. Thanks both!






On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <raghu.ang...@databricks.com&gt; 
wrote:

+1 overall and a big&nbsp;+1 to keeping offline state-rebalancing as a primary 
use case.&nbsp;


Raghu.


On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <bartkoniec...@gmail.com&gt; 
wrote:

Thank you, Jungtaek, for your answers! It's clear now.



+1 for me. It seems like a prerequisite for further ops-related improvements 
for the state store management. I mean especially here the state rebalancing 
that could rely on this read+write state store API. I don't mean here the 
dynamic state rebalancing that could probably be implemented with a lower 
latency directly in the stateful API. Instead I'm thinking  more of an offline 
job to rebalance the state and later restart the stateful pipeline with the 
changed number of shuffle partitions.


Best,

Bartosz.



On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

bump for better reach

On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

Sorry, please use this link instead for SPIP doc: 
https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing




On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <kabhwan.opensou...@gmail.com&gt; 
wrote:

Hi dev,

I'd like to start a discussion on "State Data Source - Reader".



This proposal aims to introduce a new data source "statestore" which enables 
reading the state rows from existing checkpoint via offline (batch) query. This 
will enable users to 1) create unit tests against stateful query verifying the 
state value (especially flatMapGroupsWithState), 2) gather more context on the 
status when an incident&nbsp;occurs, especially for incorrect output.


SPIP:&nbsp;https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
JIRA:&nbsp;https://issues.apache.org/jira/browse/SPARK-45511



Looking forward to your feedback!


Thanks,
Jungtaek Lim (HeartSaVioR)



ps. The scope of the project is narrowed to the reader in this SPIP, since the 
writer requires us to consider more cases. We are planning on it.

 
 
 


-- 
Bartosz Konieczny
freelance data engineer
https://www.waitingforcode.com
https://github.com/bartosz25/
https://twitter.com/waitingforcode

Reply via email to