Hi Bill, all, Thank you. I've updated the KIP to reflect pausing standby tasks as well. I think all the outstanding points have been addressed and I'm going to start the vote thread!
Cheers, Jim On Tue, May 10, 2022 at 2:43 PM Bill Bejeck <bbej...@gmail.com> wrote: > Hi Jim, > > After reading the comments on the KIP, I agree that it makes sense to pause > all activities and any changes can be made later on. > > Thanks, > Bill > > On Tue, May 10, 2022 at 4:03 AM Bruno Cadonna <cado...@apache.org> wrote: > > > Hi Jim, > > > > Thanks for the KIP! > > > > I am fine with the KIP in general. > > > > However, I am with Sophie and John to also pause the standbys for the > > reasons they brought up. Is there a specific reason you want to keep > > standbys going? It feels like premature optimization to me. We can still > > add keeping standby running in a follow up if needed. > > > > Best, > > Bruno > > > > On 10.05.22 05:15, Sophie Blee-Goldman wrote: > > > Thanks Jim, just one note/question on the standby tasks: > > > > > > At the minute, my moderately held position is that standby tasks ought > to > > >> continue reading and remain caught up. If standby tasks would run out > > of > > >> space, there are probably bigger problems. > > > > > > > > > For a single node application, or when the #pause API is invoked on all > > > instances, > > > then there won't be any further active processing and thus nothing to > > keep > > > up with, > > > right? So for that case, it's just a matter of whether any standbys > that > > > are lagging > > > will have the chance to catch up to the (paused) active task state > before > > > they stop > > > as well, in which case having them continue feels fine to me. However > > this > > > is a > > > relatively trivial benefit and I would only consider it as a deciding > > > factor when all > > > things are equal otherwise. > > > > > > My concern is the more interesting case: when this feature is used to > > pause > > > only > > > one nodes, or some subset of the overall application. In this case, > yes, > > > the standby > > > tasks will indeed fall out of sync. But the only reason I can imagine > > > someone using > > > the pause feature in such a way is because there is something going > > wrong, > > > or about > > > to go wrong, on that particular node. For example as mentioned above, > if > > > the user > > > wants to cut down on costs without stopping everything, or if the node > is > > > about to > > > run out of disk or needs to be debugged or so on. And in this case, > > > continuing to > > > process the standby tasks while other instances continue to run would > > > pretty much > > > defeat the purpose of pausing it entirely, and might have unpleasant > > > consequences > > > for the unsuspecting developer. > > > > > > All that said, I don't want to block this KIP so if you have strong > > > feelings about the > > > standby behavior I'm happy to back down. I'm only pushing back now > > because > > > it > > > felt like there wasn't any particular motivation for the standbys to > > > continue processing > > > or not, and I figured I'd try to fill in this gap with my thoughts on > the > > > matter :) > > > Either way we should just make sure that this behavior is documented > > > clearly, > > > since it may be surprising if we decide to only pause active processing > > > (another option > > > is to rename the method something like #pauseProcessing or > > > #pauseActiveProcessing > > > so that it's hard to miss). > > > > > > Thanks! Sorry for the lengthy response, but hopefully we won't need to > > > debate this any > > > further. Beyond this I'm satisfied with the latest proposal > > > > > > On Mon, May 9, 2022 at 5:16 PM John Roesler <vvcep...@apache.org> > wrote: > > > > > >> Thanks for the updates, Jim! > > >> > > >> After this discussion and your updates, this KIP looks good to me. > > >> > > >> Thanks, > > >> John > > >> > > >> On Mon, May 9, 2022, at 17:52, Jim Hughes wrote: > > >>> Hi Sophie, all, > > >>> > > >>> I've updated the KIP with feedback from the discussion so far: > > >>> > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > > >>> > > >>> As a terse summary of my current position: > > >>> Pausing will only stop processing and punctuation (respecting modular > > >>> topologies). > > >>> Paused topologies will still a) consume from input topics, b) call > the > > >>> usual commit pathways (commits will happen basically as they would > > have), > > >>> and c) standBy tasks will still be processed. > > >>> > > >>> Shout if the KIP or those details still need some TLC. Responding to > > >>> Sophie inline below. > > >>> > > >>> > > >>> On Mon, May 9, 2022 at 6:06 PM Sophie Blee-Goldman > > >>> <sop...@confluent.io.invalid> wrote: > > >>> > > >>>> Don't worry, I'm going to be adding the APIs for topology-level > > pausing > > >> as > > >>>> part of the modular topologies KIP, > > >>>> so we don't need to worry about that for now. That said, I don't > think > > >> we > > >>>> should brush it off entirely and design > > >>>> this feature in a way that's going to be incompatible or hugely > raise > > >> the > > >>>> LOE on bringing the (mostly already > > >>>> implemented) modular topologies feature into the public API, just > > >>>> because it "won the race to write a KIP" :) > > >>>> > > >>> > > >>> Yes, I'm hoping that this is all compatible with modular > topologies. I > > >>> haven't seen anything so far which seems to be a problem; this KIP is > > >> just > > >>> in a weird state to discuss details of acting on modular > topologies.:) > > >>> > > >>> > > >>>> I may be biased (ok, I definitely am), but I'm not in favor of > adding > > >> this > > >>>> as a state regardless of the modular topologies. > > >>>> First of all any change to the KafkaStreams state machine is a > > breaking > > >>>> change, no? So we would have to wait until > > >>>> the next major release which seems like an unnecessary thing to > block > > >> on. > > >>>> (Whether to add this as a state to the > > >>>> StreamThread's FSM is an implementation detail). > > >>>> > > >>> > > >>> +1. I am sold on skipping out on new states. I had that as a > rejected > > >>> alternative in the KIP and have added a few more words to that bit. > > >>> > > >>> > > >>>> Also, the semantics of using an `isPaused` method to distinguish a > > >> paused > > >>>> instance (or topology) make more sense > > >>>> to me -- this is a user-specified status, whereas the KafkaStreams > > >> state is > > >>>> intended to relay the status of the system > > >>>> itself. For example, if we are going to continue to poll during > pause, > > >> then > > >>>> shouldn't the client transition to REBALANCING? > > >>>> I believe it makes sense to still allow distinguishing these states > > >> while a > > >>>> client is paused, whereas making PAUSED its > > >>>> own state means you can't tell when the client is rebalancing vs > > >> running, > > >>>> or whether it is paused or dead: presumably > > >>>> the NOT_RUNNING/ERROR state would trump the PAUSED state, which > means > > >> you > > >>>> would not be able to rely on > > >>>> checking the state to see if you had called PAUSED on that instance. > > >>>> Obviously you can work around this by just > > >>>> maintaining a flag in the usercode, but all this feels very > unnatural > > >> to me > > >>>> vs just checking the `#isPaused` API. > > >>>> > > >>>> On that note, I had one question -- at what point would the > > `#isPaused` > > >>>> check return true? Would it do so immediately > > >>>> after pausing the instance, or only once it has finished committing > > >> offsets > > >>>> and stopped returning records? > > >>>> > > >>> > > >>> Immediately, `#isPaused` tells you about metadata. > > >>> > > >>> > > >>>> Finally, on the note of punctuators I think it would make most sense > > to > > >>>> either pause these as well or else add this an > > >>>> an explicit option for the user. If this feature is used to, for > > >> example, > > >>>> help save on processing costs while an app is > > >>>> not in use, then it would probably be surprising and perhaps > alarming > > to > > >>>> see certain kinds of processing still continue. > > >>>> > > >>> > > >>> From other parts of the discussion, I'm sold on pausing punctuation. > > >>> > > >>> > > >>>> The question of whether to continue fetching for standby tasks is > > maybe > > >> a > > >>>> bit more debatable, as it would certainly be > > >>>> nice to find your clients all caught up when you go to resume the > > >> instance > > >>>> again, but I would still strongly suggest > > >>>> pausing these as well. To use a similar example, imagine if you > paused > > >> an > > >>>> app because it was about to run out of > > >>>> disk. If the standbys kept processing and filled up the remaining > > space, > > >>>> you'd probably feel a bit betrayed by this API. > > >>>> > > >>>> WDYT? > > >>>> > > >>> > > >>> At the minute, my moderately held position is that standby tasks > ought > > to > > >>> continue reading and remain caught up. If standby tasks would run > out > > of > > >>> space, there are probably bigger problems. > > >>> > > >>> If later it is desirable to manage punctuation or standby tasks, then > > it > > >>> should be easy for future folks to modify things. > > >>> > > >>> Overall, I'd frame this KIP as "pause processing resulting in > outputs". > > >>> > > >>> Cheers, > > >>> > > >>> Jim > > >>> > > >>> > > >>> > > >>>> On Mon, May 9, 2022 at 10:33 AM Guozhang Wang <wangg...@gmail.com> > > >> wrote: > > >>>> > > >>>>> I think for named topology we can leave the scope of this KIP as > "all > > >> or > > >>>>> nothing", i.e. when you pause an instance you pause all of its > > >>>> topologies. > > >>>>> I raised this question in my previous email just trying to clarify > if > > >>>> this > > >>>>> is what you have in mind. We can leave the question of finer > > >> controlled > > >>>>> pausing behavior for later when we have named topology being > exposed > > >> via > > >>>>> another KIP. > > >>>>> > > >>>>> > > >>>>> Guozhang > > >>>>> > > >>>>> On Mon, May 9, 2022 at 7:50 AM John Roesler <vvcep...@apache.org> > > >> wrote: > > >>>>> > > >>>>>> Hi Jim, > > >>>>>> > > >>>>>> Thanks for the replies. This all sounds good to me. Just two > further > > >>>>>> comments: > > >>>>>> > > >>>>>> 3. It seems like you should aim for the simplest semantics. If the > > >>>> intent > > >>>>>> is to “pause” the instance, then you’d better pause the whole > > >> instance. > > >>>>> If > > >>>>>> you leave punctuations and standbys running, I expect we’d see bug > > >>>>> reports > > >>>>>> come in that the instance isn’t really paused. > > >>>>>> > > >>>>>> 5. Since you won the race to write a KIP, I don’t think it makes > too > > >>>> much > > >>>>>> sense to worry too much about modular topologies. When they > propose > > >>>> their > > >>>>>> KIP, they will have to specify a lot of state management behavior, > > >> and > > >>>>>> pause/resume will have to be part of it. If they have some concern > > >>>> about > > >>>>>> your KIP, they’ll chime in. It doesn’t make sense for you to try > and > > >>>>> guess > > >>>>>> what that proposal will look like. > > >>>>>> > > >>>>>> To be honest, you’re proposing a KafkaStreams runtime-level > > >>>> pause/resume > > >>>>>> function, not a topology-level one anyway, so it seems pretty > clear > > >>>> that > > >>>>> it > > >>>>>> would pause the whole runtime (of a single instance) regardless of > > >> any > > >>>>>> modular topologies. If the intent is to pause individual > topologies > > >> in > > >>>>> the > > >>>>>> future, you’d need a different API anyway. > > >>>>>> > > >>>>>> Thanks! > > >>>>>> -John > > >>>>>> > > >>>>>> On Mon, May 9, 2022, at 08:10, Jim Hughes wrote: > > >>>>>>> Hi John, > > >>>>>>> > > >>>>>>> Long emails are great; responding inline! > > >>>>>>> > > >>>>>>> On Sat, May 7, 2022 at 4:54 PM John Roesler <vvcep...@apache.org > > > > >>>>> wrote: > > >>>>>>> > > >>>>>>>> Thanks for the KIP, Jim! > > >>>>>>>> > > >>>>>>>> This conversation seems to highlight that the KIP needs to > > >> specify > > >>>>>>>> some of its behavior as well as its APIs, where the behavior is > > >>>>>>>> observable and significant to users. > > >>>>>>>> > > >>>>>>>> For example: > > >>>>>>>> > > >>>>>>>> 1. Do you plan to have a guarantee that immediately after > > >>>>>>>> calling KafkaStreams.pause(), users should observe that the > > >> instance > > >>>>>>>> stops processing new records? Or should they expect that the > > >> threads > > >>>>>>>> will continue to process some records and pause asynchronously > > >>>>>>>> (you already answered this in the thread earlier)? > > >>>>>>>> > > >>>>>>> > > >>>>>>> I'm happy to build up to a guarantee of sorts. My current idea > is > > >>>> that > > >>>>>>> pause() does not do anything "exceptional" to get control back > > >> from a > > >>>>>>> running topology. A currently running topology would get to > > >> complete > > >>>>> its > > >>>>>>> loop. > > >>>>>>> > > >>>>>>> Separately, I'm still piecing together how commits work. By some > > >>>>>>> mechanism, after a pause, I do agree that the topology needs to > > >>>> commit > > >>>>>> its > > >>>>>>> work in some manner. > > >>>>>>> > > >>>>>>> > > >>>>>>>> 2. Will the threads continue to poll new records until they > > >>>> naturally > > >>>>>> fill > > >>>>>>>> up the task buffers, or will they immediately pause their > > >> Consumers > > >>>>>>>> as well? > > >>>>>>>> > > >>>>>>> > > >>>>>>> Presently, I'm suggesting that consumers would fill up their > > >> buffers. > > >>>>>>> > > >>>>>>> > > >>>>>>>> 3. Will threads continue to call (system time) punctuators, or > > >> would > > >>>>>>>> punctuations also be paused? > > >>>>>>>> > > >>>>>>> > > >>>>>>> In my first pass at thinking through this, I left the punctuators > > >>>>>> running. > > >>>>>>> To be honest, I'm not sure what they do, so my approach is either > > >>>> lucky > > >>>>>> and > > >>>>>>> correct or it could be Very Clearly Wrong.;) > > >>>>>>> > > >>>>>>> > > >>>>>>>> I realize that some of those questions simply may not have > > >> occurred > > >>>> to > > >>>>>>>> you, so this is not a criticism for leaving them off; I'm just > > >>>>> pointing > > >>>>>> out > > >>>>>>>> that although we don't tend to mention implementation details in > > >>>> KIPs, > > >>>>>>>> we also can't be too high level, since there are a lot of > > >>>> operational > > >>>>>>>> details that users rely on to achieve various behaviors in > > >> Streams. > > >>>>>>>> > > >>>>>>> > > >>>>>>> Ayup, I will add some details as we iron out the guarantees, > > >>>>>> implementation > > >>>>>>> details that are at the API level. This one is tough since > > >> internal > > >>>>>>> features like NamedTopologies are part of the discussion. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> A couple more comments: > > >>>>>>>> > > >>>>>>>> 4. +1 to what Guozhang said. It seems like we should we also do > a > > >>>>> commit > > >>>>>>>> before entering the paused state. That way, any open > transactions > > >>>>> would > > >>>>>>>> be closed and not have to worry about timing out. Even under > > >> ALOS, > > >>>> it > > >>>>>>>> seems best to go ahead and complete the processing of in-flight > > >>>>> records > > >>>>>>>> by committing. That way, if anything happens to die while it's > > >>>> paused, > > >>>>>>>> existing > > >>>>>>>> work won't have to be repeated. Plus, if there are any > processors > > >>>> with > > >>>>>> side > > >>>>>>>> effects, users won't have to tolerate weird edge cases where a > > >> pause > > >>>>>> occurs > > >>>>>>>> after a processor sees a record, but before the result is sent > to > > >>>> its > > >>>>>>>> outputs. > > >>>>>>>> > > >>>>>>>> 5. I noticed that you proposed not to add a PAUSED state, but I > > >>>> didn't > > >>>>>>>> follow > > >>>>>>>> the rationale. Adding a state seems beneficial for a number of > > >>>>> reasons: > > >>>>>>>> StreamThreads already use the thread state to determine whether > > >> to > > >>>>>> process > > >>>>>>>> or not, so avoiding a new State would just mean adding a > separate > > >>>> flag > > >>>>>> to > > >>>>>>>> track > > >>>>>>>> and then checking your new flag in addition to the State in the > > >>>>> thread. > > >>>>>>>> Also, > > >>>>>>>> operating Streams applications is a non-trivial task, and users > > >> rely > > >>>>> on > > >>>>>>>> the State > > >>>>>>>> (and transitions) to understand Streams's behavior. Adding a > > >> PAUSED > > >>>>>> state > > >>>>>>>> is an elegant way to communicate to operators what is happening > > >> with > > >>>>> the > > >>>>>>>> application. Note that the person digging though logs and > > >> metrics, > > >>>>>> trying > > >>>>>>>> to understand why the application isn't doing anything is > > >> probably > > >>>> not > > >>>>>>>> going > > >>>>>>>> to be the same person who is calling pause() and resume(). Also, > > >> if > > >>>>> you > > >>>>>> add > > >>>>>>>> a state, you don't need `isPaused()`. > > >>>>>>>> > > >>>>>>>> 5b. If you buy the arguments to go ahead and commit as well as > > >> the > > >>>>>>>> argument to add a State, then I'd also suggest to follow the > > >>>> existing > > >>>>>>>> patterns > > >>>>>>>> for the shutdown states by also adding PAUSING. That > > >>>>>>>> way, you'll also expose a way to understand that Streams > received > > >>>> the > > >>>>>>>> signal > > >>>>>>>> to pause, and that it's still processing and committing some > > >> records > > >>>>> in > > >>>>>>>> preparation to enter a PAUSED state. I'm not sure if a RESUMING > > >>>> state > > >>>>>> would > > >>>>>>>> also make sense. > > >>>>>>>> > > >>>>>>> > > >>>>>>> I hit a tricky bit when thinking through having a PAUSED > > >> state... If > > >>>>> one > > >>>>>>> is using Named Topologies, and some of them are paused, what > > >> state is > > >>>>> the > > >>>>>>> Streams instance in? If we can agree on that, things may become > > >>>>>> clear.... > > >>>>>>> I can see two quick ideas: > > >>>>>>> > > >>>>>>> 1. The state is RUNNING and NamedTopologies have some other way > > >> to > > >>>>>>> indicate state. > > >>>>>>> > > >>>>>>> 2. The state is something messy like PARTIALLY_PAUSED to reflect > > >>>> that > > >>>>>> the > > >>>>>>> instance has something interesting going on. > > >>>>>>> > > >>>>>>> When I poked at things initially, I did try out having different > > >>>>> states, > > >>>>>>> and I readily agree that a PAUSING state may make sense. > > >> (Especially > > >>>>> if > > >>>>>>> there's a need to run commits before transitioning all the way to > > >>>>>> PAUSED.) > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>>> And that's all I have to say about that. I hope you don't find > my > > >>>>>>>> long message offputting. I'm fundamentally in favor of your KIP, > > >>>>>>>> and I think with a little more explanation in the KIP, and a few > > >>>>>>>> small tweaks to the proposal, we'll be able to provide good > > >>>>>>>> ergonomics to our users. > > >>>>>>>> > > >>>>>>> > > >>>>>>> Thanks! > > >>>>>>> > > >>>>>>> Jim > > >>>>>>> > > >>>>>>> > > >>>>>>>> Thanks, > > >>>>>>>> -John > > >>>>>>>> > > >>>>>>>> On Sat, May 7, 2022, at 00:06, Guozhang Wang wrote: > > >>>>>>>>> I'm in favor of the "just pausing the instance itself“ option > > >> as > > >>>>>> well. As > > >>>>>>>>> for EOS, the point is that when the processing is paused, we > > >> would > > >>>>> not > > >>>>>>>>> trigger any `producer.send` during the time, and the > > >> transaction > > >>>>>> timeout > > >>>>>>>> is > > >>>>>>>>> sort of relying on that behavior, so my point was that it's > > >>>> probably > > >>>>>>>> better > > >>>>>>>>> to also commit the processing before we pause it. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Guozhang > > >>>>>>>>> > > >>>>>>>>> On Fri, May 6, 2022 at 6:12 PM Jim Hughes > > >>>>>> <jhug...@confluent.io.invalid> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi Matthias, > > >>>>>>>>>> > > >>>>>>>>>> Since the only thing which will be paused is processing the > > >>>>>> topology, I > > >>>>>>>>>> think we can let commits happen naturally. > > >>>>>>>>>> > > >>>>>>>>>> Good point about getting the paused state to new members; it > > >> is > > >>>>>> seeming > > >>>>>>>>>> like the "building block" approach is a good one to keep > > >> things > > >>>>>> simple > > >>>>>>>> at > > >>>>>>>>>> first. > > >>>>>>>>>> > > >>>>>>>>>> Cheers, > > >>>>>>>>>> > > >>>>>>>>>> Jim > > >>>>>>>>>> > > >>>>>>>>>> On Fri, May 6, 2022 at 8:31 PM Matthias J. Sax < > > >> mj...@apache.org > > >>>>> > > >>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> I think it's tricky to propagate a pauseAll() via the > > >> rebalance > > >>>>>>>>>>> protocol. New members joining the group would need to get > > >>>> paused, > > >>>>>> too? > > >>>>>>>>>>> Could there be weird race conditions with overlapping > > >>>> pauseAll() > > >>>>>> and > > >>>>>>>>>>> resumeAll() calls on different instanced while there could > > >> be a > > >>>>>>>> errors / > > >>>>>>>>>>> network partitions or similar? > > >>>>>>>>>>> > > >>>>>>>>>>> I would argue that similar to IQ, we provide the basic > > >> building > > >>>>>>>> blocks, > > >>>>>>>>>>> and leave it the user users to implement cross instance > > >>>>> management > > >>>>>>>> for a > > >>>>>>>>>>> pauseAll() scenario. -- Also, if there is really demand, we > > >> can > > >>>>>> always > > >>>>>>>>>>> add pauseAll()/resumeAll() as follow up work. > > >>>>>>>>>>> > > >>>>>>>>>>> About named typologies: I agree to Jim to not include them > > >> in > > >>>>> this > > >>>>>> KIP > > >>>>>>>>>>> as they are not a public feature yet. If we make named > > >>>> typologies > > >>>>>>>>>>> public, the corresponding KIP should extend the pause/resume > > >>>>>> feature > > >>>>>>>>>>> (ie, APIs) accordingly. Of course, the code can (and should) > > >>>>>> already > > >>>>>>>> be > > >>>>>>>>>>> setup to support it to be future proof. > > >>>>>>>>>>> > > >>>>>>>>>>> Good call out about commit and EOS -- to simplify it, I > > >> think > > >>>> it > > >>>>>> might > > >>>>>>>>>>> be good to commit also for the at-least-once case? > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -Matthias > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On 5/6/22 1:05 PM, Jim Hughes wrote: > > >>>>>>>>>>>> Hi Bill, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Great questions; I'll do my best to reply inline: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Fri, May 6, 2022 at 3:21 PM Bill Bejeck < > > >>>> bbej...@gmail.com> > > >>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Hi Jim, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks for the KIP. I have a couple of meta-questions as > > >>>>> well: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> 1) Regarding pausing only a subset of running instances, > > >> I'm > > >>>>>>>> thinking > > >>>>>>>>>>> there > > >>>>>>>>>>>>> may be a use case for pausing all of them. > > >>>>>>>>>>>>> Would it make sense to also allow for pausing all > > >>>>> instances > > >>>>>> by > > >>>>>>>>>>> adding a > > >>>>>>>>>>>>> method `pauseAll()` or something similar? > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Honestly, I'm indifferent on this point. Presently, I > > >> think > > >>>>>> what I > > >>>>>>>>>> have > > >>>>>>>>>>>> proposed is the minimal change to get the ability to pause > > >>>> and > > >>>>>>>> resume > > >>>>>>>>>>>> processing. If adding a 'pauseAll()' is required, I'd be > > >>>> happy > > >>>>>> to > > >>>>>>>> do > > >>>>>>>>>>> that! > > >>>>>>>>>>>> > > >>>>>>>>>>>> From Guozhang's email, it sounds like this would require > > >>>> using > > >>>>>> the > > >>>>>>>>>>>> rebalance protocol to trigger the coordination. Would > > >> there > > >>>> be > > >>>>>>>> enough > > >>>>>>>>>>> room > > >>>>>>>>>>>> in that approach to indicate that a named topology is to > > >> be > > >>>>>> paused > > >>>>>>>>>> across > > >>>>>>>>>>>> all nodes? > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>>> 2) Would pausing affect standby tasks? For example, > > >> imagine > > >>>>>> there > > >>>>>>>>>> are 3 > > >>>>>>>>>>>>> instances A, B, and C. > > >>>>>>>>>>>>> A user elects to pause instance C only but it hosts > > >> the > > >>>>>> standby > > >>>>>>>>>>> tasks > > >>>>>>>>>>>>> for A. > > >>>>>>>>>>>>> Would the standby tasks on the paused application > > >>>> continue > > >>>>>> to > > >>>>>>>> read > > >>>>>>>>>>> from > > >>>>>>>>>>>>> the changelog topic? > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Yes, standby tasks would continue reading from the > > >> changelog > > >>>>>> topic. > > >>>>>>>>>> All > > >>>>>>>>>>>> consumers would continue reading to avoid getting dropped > > >>>> from > > >>>>>> their > > >>>>>>>>>>>> consumer groups. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Cheers, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Jim > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>> Bill > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Fri, May 6, 2022 at 2:44 PM Jim Hughes > > >>>>>>>>>> <jhug...@confluent.io.invalid > > >>>>>>>>>>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Hi Guozhang, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks for the feedback; responses inline below: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Fri, May 6, 2022 at 1:09 PM Guozhang Wang < > > >>>>>> wangg...@gmail.com> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Hello Jim, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Thanks for the proposed KIP. I have some meta questions > > >>>>> about > > >>>>>> it: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> 1) Would an instance always pause/resume all of its > > >>>> current > > >>>>>> owned > > >>>>>>>>>>>>>>> topologies (i.e. the named topologies), or are there > > >> any > > >>>>>>>> scenarios > > >>>>>>>>>>>>> where > > >>>>>>>>>>>>>> we > > >>>>>>>>>>>>>>> only want to pause/resume a subset of them? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> An instance may wish to pause some of its named > > >> topologies. > > >>>>> I > > >>>>>> was > > >>>>>>>>>>> unsure > > >>>>>>>>>>>>>> what to say about named topologies in the KIP since they > > >>>> seem > > >>>>>> to > > >>>>>>>> be > > >>>>>>>>>> an > > >>>>>>>>>>>>>> internal detail at the moment. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I intend to add to KafkaStreamsNamedTopologyWrapper > > >> methods > > >>>>>> like: > > >>>>>>>>>>>>>> public void pauseNamedTopology(final String > > >>>>>> topologyToPause) > > >>>>>>>>>>>>>> public boolean isNamedTopologyPaused(final String > > >>>>>> topology) > > >>>>>>>>>>>>>> public void resumeNamedTopology(final String > > >>>>>>>> topologyToResume) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> 2) From a user's perspective, do we want to always > > >> issue a > > >>>>>>>>>>>>> `pause/resume` > > >>>>>>>>>>>>>>> to all the instances or not? For example, we can define > > >>>> the > > >>>>>>>>>> semantics > > >>>>>>>>>>>>> of > > >>>>>>>>>>>>>>> the function as "you only need to call this function on > > >>>> any > > >>>>> of > > >>>>>>>> the > > >>>>>>>>>>>>>>> application's instances, and all instances would then > > >>>> pause > > >>>>>> (via > > >>>>>>>> the > > >>>>>>>>>>>>>>> rebalance error codes)", or as "you would call this > > >>>> function > > >>>>>> for > > >>>>>>>> all > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> instances of an application". Which one are you > > >> referring > > >>>>> to? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> My initial intent is that one would call this function > > >> on > > >>>> any > > >>>>>>>>>> instances > > >>>>>>>>>>>>> of > > >>>>>>>>>>>>>> the application that one wishes to pause. This should > > >>>> allow > > >>>>>> more > > >>>>>>>>>>> control > > >>>>>>>>>>>>>> (in case one wanted to pause a portion of the > > >> instances). > > >>>> On > > >>>>>> the > > >>>>>>>>>> other > > >>>>>>>>>>>>>> hand, this approach would put more work on the > > >> implementer > > >>>> to > > >>>>>>>>>>> coordinate > > >>>>>>>>>>>>>> calling pause or resume across instances. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> If the other option is more suitable, happy to do that > > >>>>> instead. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> 3) With EOS, there's a transaction timeout which would > > >>>>>> determine > > >>>>>>>> how > > >>>>>>>>>>>>>> long a > > >>>>>>>>>>>>>>> transaction can stay idle before it's force-aborted on > > >> the > > >>>>>> broker > > >>>>>>>>>>>>> side. I > > >>>>>>>>>>>>>>> think when a pause is issued, that means we'd need to > > >>>>>> immediately > > >>>>>>>>>>>>> commit > > >>>>>>>>>>>>>>> the current transaction for EOS since we do not know > > >> how > > >>>>> long > > >>>>>> we > > >>>>>>>>>> could > > >>>>>>>>>>>>>>> pause for. Is that right? If yes could you please > > >> clarify > > >>>>>> that in > > >>>>>>>>>> the > > >>>>>>>>>>>>> doc > > >>>>>>>>>>>>>>> as well. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Good point. My intent is for pause() to wait for the > > >> next > > >>>>>>>> iteration > > >>>>>>>>>>>>>> through `runOnce()` and then only skip over the > > >> processing > > >>>>> for > > >>>>>>>> paused > > >>>>>>>>>>>>> tasks > > >>>>>>>>>>>>>> in `taskManager.process(numIterations, time)`. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Do commits live inside that call or do they live > > >>>>>> across/outside of > > >>>>>>>>>> it? > > >>>>>>>>>>>>> In > > >>>>>>>>>>>>>> the former case, I think there shouldn't be any issues > > >> with > > >>>>>> EOS. > > >>>>>>>>>>>>>> Otherwise, we may need to work through some details to > > >> get > > >>>>> EOS > > >>>>>>>> right. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Once we figure that out, I can update the KIP. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Jim > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Guozhang > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, May 4, 2022 at 10:51 AM Jim Hughes > > >>>>>>>>>>>>> <jhug...@confluent.io.invalid > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Hi all, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> I have written up a KIP for adding the ability to > > >> pause > > >>>> and > > >>>>>>>> resume > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> processing of a topology in AK Streams. The KIP is > > >> here: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thanks in advance for your feedback! > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Cheers, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Jim > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> -- > > >>>>>>>>>>>>>>> -- Guozhang > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> -- Guozhang > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> -- Guozhang > > >>>>> > > >>>> > > >> > > > > > >