Hi Jim, After reading the comments on the KIP, I agree that it makes sense to pause all activities and any changes can be made later on.
Thanks, Bill On Tue, May 10, 2022 at 4:03 AM Bruno Cadonna <cado...@apache.org> wrote: > Hi Jim, > > Thanks for the KIP! > > I am fine with the KIP in general. > > However, I am with Sophie and John to also pause the standbys for the > reasons they brought up. Is there a specific reason you want to keep > standbys going? It feels like premature optimization to me. We can still > add keeping standby running in a follow up if needed. > > Best, > Bruno > > On 10.05.22 05:15, Sophie Blee-Goldman wrote: > > Thanks Jim, just one note/question on the standby tasks: > > > > At the minute, my moderately held position is that standby tasks ought to > >> continue reading and remain caught up. If standby tasks would run out > of > >> space, there are probably bigger problems. > > > > > > For a single node application, or when the #pause API is invoked on all > > instances, > > then there won't be any further active processing and thus nothing to > keep > > up with, > > right? So for that case, it's just a matter of whether any standbys that > > are lagging > > will have the chance to catch up to the (paused) active task state before > > they stop > > as well, in which case having them continue feels fine to me. However > this > > is a > > relatively trivial benefit and I would only consider it as a deciding > > factor when all > > things are equal otherwise. > > > > My concern is the more interesting case: when this feature is used to > pause > > only > > one nodes, or some subset of the overall application. In this case, yes, > > the standby > > tasks will indeed fall out of sync. But the only reason I can imagine > > someone using > > the pause feature in such a way is because there is something going > wrong, > > or about > > to go wrong, on that particular node. For example as mentioned above, if > > the user > > wants to cut down on costs without stopping everything, or if the node is > > about to > > run out of disk or needs to be debugged or so on. And in this case, > > continuing to > > process the standby tasks while other instances continue to run would > > pretty much > > defeat the purpose of pausing it entirely, and might have unpleasant > > consequences > > for the unsuspecting developer. > > > > All that said, I don't want to block this KIP so if you have strong > > feelings about the > > standby behavior I'm happy to back down. I'm only pushing back now > because > > it > > felt like there wasn't any particular motivation for the standbys to > > continue processing > > or not, and I figured I'd try to fill in this gap with my thoughts on the > > matter :) > > Either way we should just make sure that this behavior is documented > > clearly, > > since it may be surprising if we decide to only pause active processing > > (another option > > is to rename the method something like #pauseProcessing or > > #pauseActiveProcessing > > so that it's hard to miss). > > > > Thanks! Sorry for the lengthy response, but hopefully we won't need to > > debate this any > > further. Beyond this I'm satisfied with the latest proposal > > > > On Mon, May 9, 2022 at 5:16 PM John Roesler <vvcep...@apache.org> wrote: > > > >> Thanks for the updates, Jim! > >> > >> After this discussion and your updates, this KIP looks good to me. > >> > >> Thanks, > >> John > >> > >> On Mon, May 9, 2022, at 17:52, Jim Hughes wrote: > >>> Hi Sophie, all, > >>> > >>> I've updated the KIP with feedback from the discussion so far: > >>> > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > >>> > >>> As a terse summary of my current position: > >>> Pausing will only stop processing and punctuation (respecting modular > >>> topologies). > >>> Paused topologies will still a) consume from input topics, b) call the > >>> usual commit pathways (commits will happen basically as they would > have), > >>> and c) standBy tasks will still be processed. > >>> > >>> Shout if the KIP or those details still need some TLC. Responding to > >>> Sophie inline below. > >>> > >>> > >>> On Mon, May 9, 2022 at 6:06 PM Sophie Blee-Goldman > >>> <sop...@confluent.io.invalid> wrote: > >>> > >>>> Don't worry, I'm going to be adding the APIs for topology-level > pausing > >> as > >>>> part of the modular topologies KIP, > >>>> so we don't need to worry about that for now. That said, I don't think > >> we > >>>> should brush it off entirely and design > >>>> this feature in a way that's going to be incompatible or hugely raise > >> the > >>>> LOE on bringing the (mostly already > >>>> implemented) modular topologies feature into the public API, just > >>>> because it "won the race to write a KIP" :) > >>>> > >>> > >>> Yes, I'm hoping that this is all compatible with modular topologies. I > >>> haven't seen anything so far which seems to be a problem; this KIP is > >> just > >>> in a weird state to discuss details of acting on modular topologies.:) > >>> > >>> > >>>> I may be biased (ok, I definitely am), but I'm not in favor of adding > >> this > >>>> as a state regardless of the modular topologies. > >>>> First of all any change to the KafkaStreams state machine is a > breaking > >>>> change, no? So we would have to wait until > >>>> the next major release which seems like an unnecessary thing to block > >> on. > >>>> (Whether to add this as a state to the > >>>> StreamThread's FSM is an implementation detail). > >>>> > >>> > >>> +1. I am sold on skipping out on new states. I had that as a rejected > >>> alternative in the KIP and have added a few more words to that bit. > >>> > >>> > >>>> Also, the semantics of using an `isPaused` method to distinguish a > >> paused > >>>> instance (or topology) make more sense > >>>> to me -- this is a user-specified status, whereas the KafkaStreams > >> state is > >>>> intended to relay the status of the system > >>>> itself. For example, if we are going to continue to poll during pause, > >> then > >>>> shouldn't the client transition to REBALANCING? > >>>> I believe it makes sense to still allow distinguishing these states > >> while a > >>>> client is paused, whereas making PAUSED its > >>>> own state means you can't tell when the client is rebalancing vs > >> running, > >>>> or whether it is paused or dead: presumably > >>>> the NOT_RUNNING/ERROR state would trump the PAUSED state, which means > >> you > >>>> would not be able to rely on > >>>> checking the state to see if you had called PAUSED on that instance. > >>>> Obviously you can work around this by just > >>>> maintaining a flag in the usercode, but all this feels very unnatural > >> to me > >>>> vs just checking the `#isPaused` API. > >>>> > >>>> On that note, I had one question -- at what point would the > `#isPaused` > >>>> check return true? Would it do so immediately > >>>> after pausing the instance, or only once it has finished committing > >> offsets > >>>> and stopped returning records? > >>>> > >>> > >>> Immediately, `#isPaused` tells you about metadata. > >>> > >>> > >>>> Finally, on the note of punctuators I think it would make most sense > to > >>>> either pause these as well or else add this an > >>>> an explicit option for the user. If this feature is used to, for > >> example, > >>>> help save on processing costs while an app is > >>>> not in use, then it would probably be surprising and perhaps alarming > to > >>>> see certain kinds of processing still continue. > >>>> > >>> > >>> From other parts of the discussion, I'm sold on pausing punctuation. > >>> > >>> > >>>> The question of whether to continue fetching for standby tasks is > maybe > >> a > >>>> bit more debatable, as it would certainly be > >>>> nice to find your clients all caught up when you go to resume the > >> instance > >>>> again, but I would still strongly suggest > >>>> pausing these as well. To use a similar example, imagine if you paused > >> an > >>>> app because it was about to run out of > >>>> disk. If the standbys kept processing and filled up the remaining > space, > >>>> you'd probably feel a bit betrayed by this API. > >>>> > >>>> WDYT? > >>>> > >>> > >>> At the minute, my moderately held position is that standby tasks ought > to > >>> continue reading and remain caught up. If standby tasks would run out > of > >>> space, there are probably bigger problems. > >>> > >>> If later it is desirable to manage punctuation or standby tasks, then > it > >>> should be easy for future folks to modify things. > >>> > >>> Overall, I'd frame this KIP as "pause processing resulting in outputs". > >>> > >>> Cheers, > >>> > >>> Jim > >>> > >>> > >>> > >>>> On Mon, May 9, 2022 at 10:33 AM Guozhang Wang <wangg...@gmail.com> > >> wrote: > >>>> > >>>>> I think for named topology we can leave the scope of this KIP as "all > >> or > >>>>> nothing", i.e. when you pause an instance you pause all of its > >>>> topologies. > >>>>> I raised this question in my previous email just trying to clarify if > >>>> this > >>>>> is what you have in mind. We can leave the question of finer > >> controlled > >>>>> pausing behavior for later when we have named topology being exposed > >> via > >>>>> another KIP. > >>>>> > >>>>> > >>>>> Guozhang > >>>>> > >>>>> On Mon, May 9, 2022 at 7:50 AM John Roesler <vvcep...@apache.org> > >> wrote: > >>>>> > >>>>>> Hi Jim, > >>>>>> > >>>>>> Thanks for the replies. This all sounds good to me. Just two further > >>>>>> comments: > >>>>>> > >>>>>> 3. It seems like you should aim for the simplest semantics. If the > >>>> intent > >>>>>> is to “pause” the instance, then you’d better pause the whole > >> instance. > >>>>> If > >>>>>> you leave punctuations and standbys running, I expect we’d see bug > >>>>> reports > >>>>>> come in that the instance isn’t really paused. > >>>>>> > >>>>>> 5. Since you won the race to write a KIP, I don’t think it makes too > >>>> much > >>>>>> sense to worry too much about modular topologies. When they propose > >>>> their > >>>>>> KIP, they will have to specify a lot of state management behavior, > >> and > >>>>>> pause/resume will have to be part of it. If they have some concern > >>>> about > >>>>>> your KIP, they’ll chime in. It doesn’t make sense for you to try and > >>>>> guess > >>>>>> what that proposal will look like. > >>>>>> > >>>>>> To be honest, you’re proposing a KafkaStreams runtime-level > >>>> pause/resume > >>>>>> function, not a topology-level one anyway, so it seems pretty clear > >>>> that > >>>>> it > >>>>>> would pause the whole runtime (of a single instance) regardless of > >> any > >>>>>> modular topologies. If the intent is to pause individual topologies > >> in > >>>>> the > >>>>>> future, you’d need a different API anyway. > >>>>>> > >>>>>> Thanks! > >>>>>> -John > >>>>>> > >>>>>> On Mon, May 9, 2022, at 08:10, Jim Hughes wrote: > >>>>>>> Hi John, > >>>>>>> > >>>>>>> Long emails are great; responding inline! > >>>>>>> > >>>>>>> On Sat, May 7, 2022 at 4:54 PM John Roesler <vvcep...@apache.org> > >>>>> wrote: > >>>>>>> > >>>>>>>> Thanks for the KIP, Jim! > >>>>>>>> > >>>>>>>> This conversation seems to highlight that the KIP needs to > >> specify > >>>>>>>> some of its behavior as well as its APIs, where the behavior is > >>>>>>>> observable and significant to users. > >>>>>>>> > >>>>>>>> For example: > >>>>>>>> > >>>>>>>> 1. Do you plan to have a guarantee that immediately after > >>>>>>>> calling KafkaStreams.pause(), users should observe that the > >> instance > >>>>>>>> stops processing new records? Or should they expect that the > >> threads > >>>>>>>> will continue to process some records and pause asynchronously > >>>>>>>> (you already answered this in the thread earlier)? > >>>>>>>> > >>>>>>> > >>>>>>> I'm happy to build up to a guarantee of sorts. My current idea is > >>>> that > >>>>>>> pause() does not do anything "exceptional" to get control back > >> from a > >>>>>>> running topology. A currently running topology would get to > >> complete > >>>>> its > >>>>>>> loop. > >>>>>>> > >>>>>>> Separately, I'm still piecing together how commits work. By some > >>>>>>> mechanism, after a pause, I do agree that the topology needs to > >>>> commit > >>>>>> its > >>>>>>> work in some manner. > >>>>>>> > >>>>>>> > >>>>>>>> 2. Will the threads continue to poll new records until they > >>>> naturally > >>>>>> fill > >>>>>>>> up the task buffers, or will they immediately pause their > >> Consumers > >>>>>>>> as well? > >>>>>>>> > >>>>>>> > >>>>>>> Presently, I'm suggesting that consumers would fill up their > >> buffers. > >>>>>>> > >>>>>>> > >>>>>>>> 3. Will threads continue to call (system time) punctuators, or > >> would > >>>>>>>> punctuations also be paused? > >>>>>>>> > >>>>>>> > >>>>>>> In my first pass at thinking through this, I left the punctuators > >>>>>> running. > >>>>>>> To be honest, I'm not sure what they do, so my approach is either > >>>> lucky > >>>>>> and > >>>>>>> correct or it could be Very Clearly Wrong.;) > >>>>>>> > >>>>>>> > >>>>>>>> I realize that some of those questions simply may not have > >> occurred > >>>> to > >>>>>>>> you, so this is not a criticism for leaving them off; I'm just > >>>>> pointing > >>>>>> out > >>>>>>>> that although we don't tend to mention implementation details in > >>>> KIPs, > >>>>>>>> we also can't be too high level, since there are a lot of > >>>> operational > >>>>>>>> details that users rely on to achieve various behaviors in > >> Streams. > >>>>>>>> > >>>>>>> > >>>>>>> Ayup, I will add some details as we iron out the guarantees, > >>>>>> implementation > >>>>>>> details that are at the API level. This one is tough since > >> internal > >>>>>>> features like NamedTopologies are part of the discussion. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> A couple more comments: > >>>>>>>> > >>>>>>>> 4. +1 to what Guozhang said. It seems like we should we also do a > >>>>> commit > >>>>>>>> before entering the paused state. That way, any open transactions > >>>>> would > >>>>>>>> be closed and not have to worry about timing out. Even under > >> ALOS, > >>>> it > >>>>>>>> seems best to go ahead and complete the processing of in-flight > >>>>> records > >>>>>>>> by committing. That way, if anything happens to die while it's > >>>> paused, > >>>>>>>> existing > >>>>>>>> work won't have to be repeated. Plus, if there are any processors > >>>> with > >>>>>> side > >>>>>>>> effects, users won't have to tolerate weird edge cases where a > >> pause > >>>>>> occurs > >>>>>>>> after a processor sees a record, but before the result is sent to > >>>> its > >>>>>>>> outputs. > >>>>>>>> > >>>>>>>> 5. I noticed that you proposed not to add a PAUSED state, but I > >>>> didn't > >>>>>>>> follow > >>>>>>>> the rationale. Adding a state seems beneficial for a number of > >>>>> reasons: > >>>>>>>> StreamThreads already use the thread state to determine whether > >> to > >>>>>> process > >>>>>>>> or not, so avoiding a new State would just mean adding a separate > >>>> flag > >>>>>> to > >>>>>>>> track > >>>>>>>> and then checking your new flag in addition to the State in the > >>>>> thread. > >>>>>>>> Also, > >>>>>>>> operating Streams applications is a non-trivial task, and users > >> rely > >>>>> on > >>>>>>>> the State > >>>>>>>> (and transitions) to understand Streams's behavior. Adding a > >> PAUSED > >>>>>> state > >>>>>>>> is an elegant way to communicate to operators what is happening > >> with > >>>>> the > >>>>>>>> application. Note that the person digging though logs and > >> metrics, > >>>>>> trying > >>>>>>>> to understand why the application isn't doing anything is > >> probably > >>>> not > >>>>>>>> going > >>>>>>>> to be the same person who is calling pause() and resume(). Also, > >> if > >>>>> you > >>>>>> add > >>>>>>>> a state, you don't need `isPaused()`. > >>>>>>>> > >>>>>>>> 5b. If you buy the arguments to go ahead and commit as well as > >> the > >>>>>>>> argument to add a State, then I'd also suggest to follow the > >>>> existing > >>>>>>>> patterns > >>>>>>>> for the shutdown states by also adding PAUSING. That > >>>>>>>> way, you'll also expose a way to understand that Streams received > >>>> the > >>>>>>>> signal > >>>>>>>> to pause, and that it's still processing and committing some > >> records > >>>>> in > >>>>>>>> preparation to enter a PAUSED state. I'm not sure if a RESUMING > >>>> state > >>>>>> would > >>>>>>>> also make sense. > >>>>>>>> > >>>>>>> > >>>>>>> I hit a tricky bit when thinking through having a PAUSED > >> state... If > >>>>> one > >>>>>>> is using Named Topologies, and some of them are paused, what > >> state is > >>>>> the > >>>>>>> Streams instance in? If we can agree on that, things may become > >>>>>> clear.... > >>>>>>> I can see two quick ideas: > >>>>>>> > >>>>>>> 1. The state is RUNNING and NamedTopologies have some other way > >> to > >>>>>>> indicate state. > >>>>>>> > >>>>>>> 2. The state is something messy like PARTIALLY_PAUSED to reflect > >>>> that > >>>>>> the > >>>>>>> instance has something interesting going on. > >>>>>>> > >>>>>>> When I poked at things initially, I did try out having different > >>>>> states, > >>>>>>> and I readily agree that a PAUSING state may make sense. > >> (Especially > >>>>> if > >>>>>>> there's a need to run commits before transitioning all the way to > >>>>>> PAUSED.) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> And that's all I have to say about that. I hope you don't find my > >>>>>>>> long message offputting. I'm fundamentally in favor of your KIP, > >>>>>>>> and I think with a little more explanation in the KIP, and a few > >>>>>>>> small tweaks to the proposal, we'll be able to provide good > >>>>>>>> ergonomics to our users. > >>>>>>>> > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> Jim > >>>>>>> > >>>>>>> > >>>>>>>> Thanks, > >>>>>>>> -John > >>>>>>>> > >>>>>>>> On Sat, May 7, 2022, at 00:06, Guozhang Wang wrote: > >>>>>>>>> I'm in favor of the "just pausing the instance itself“ option > >> as > >>>>>> well. As > >>>>>>>>> for EOS, the point is that when the processing is paused, we > >> would > >>>>> not > >>>>>>>>> trigger any `producer.send` during the time, and the > >> transaction > >>>>>> timeout > >>>>>>>> is > >>>>>>>>> sort of relying on that behavior, so my point was that it's > >>>> probably > >>>>>>>> better > >>>>>>>>> to also commit the processing before we pause it. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Guozhang > >>>>>>>>> > >>>>>>>>> On Fri, May 6, 2022 at 6:12 PM Jim Hughes > >>>>>> <jhug...@confluent.io.invalid> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Matthias, > >>>>>>>>>> > >>>>>>>>>> Since the only thing which will be paused is processing the > >>>>>> topology, I > >>>>>>>>>> think we can let commits happen naturally. > >>>>>>>>>> > >>>>>>>>>> Good point about getting the paused state to new members; it > >> is > >>>>>> seeming > >>>>>>>>>> like the "building block" approach is a good one to keep > >> things > >>>>>> simple > >>>>>>>> at > >>>>>>>>>> first. > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> Jim > >>>>>>>>>> > >>>>>>>>>> On Fri, May 6, 2022 at 8:31 PM Matthias J. Sax < > >> mj...@apache.org > >>>>> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I think it's tricky to propagate a pauseAll() via the > >> rebalance > >>>>>>>>>>> protocol. New members joining the group would need to get > >>>> paused, > >>>>>> too? > >>>>>>>>>>> Could there be weird race conditions with overlapping > >>>> pauseAll() > >>>>>> and > >>>>>>>>>>> resumeAll() calls on different instanced while there could > >> be a > >>>>>>>> errors / > >>>>>>>>>>> network partitions or similar? > >>>>>>>>>>> > >>>>>>>>>>> I would argue that similar to IQ, we provide the basic > >> building > >>>>>>>> blocks, > >>>>>>>>>>> and leave it the user users to implement cross instance > >>>>> management > >>>>>>>> for a > >>>>>>>>>>> pauseAll() scenario. -- Also, if there is really demand, we > >> can > >>>>>> always > >>>>>>>>>>> add pauseAll()/resumeAll() as follow up work. > >>>>>>>>>>> > >>>>>>>>>>> About named typologies: I agree to Jim to not include them > >> in > >>>>> this > >>>>>> KIP > >>>>>>>>>>> as they are not a public feature yet. If we make named > >>>> typologies > >>>>>>>>>>> public, the corresponding KIP should extend the pause/resume > >>>>>> feature > >>>>>>>>>>> (ie, APIs) accordingly. Of course, the code can (and should) > >>>>>> already > >>>>>>>> be > >>>>>>>>>>> setup to support it to be future proof. > >>>>>>>>>>> > >>>>>>>>>>> Good call out about commit and EOS -- to simplify it, I > >> think > >>>> it > >>>>>> might > >>>>>>>>>>> be good to commit also for the at-least-once case? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -Matthias > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 5/6/22 1:05 PM, Jim Hughes wrote: > >>>>>>>>>>>> Hi Bill, > >>>>>>>>>>>> > >>>>>>>>>>>> Great questions; I'll do my best to reply inline: > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, May 6, 2022 at 3:21 PM Bill Bejeck < > >>>> bbej...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Jim, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks for the KIP. I have a couple of meta-questions as > >>>>> well: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1) Regarding pausing only a subset of running instances, > >> I'm > >>>>>>>> thinking > >>>>>>>>>>> there > >>>>>>>>>>>>> may be a use case for pausing all of them. > >>>>>>>>>>>>> Would it make sense to also allow for pausing all > >>>>> instances > >>>>>> by > >>>>>>>>>>> adding a > >>>>>>>>>>>>> method `pauseAll()` or something similar? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Honestly, I'm indifferent on this point. Presently, I > >> think > >>>>>> what I > >>>>>>>>>> have > >>>>>>>>>>>> proposed is the minimal change to get the ability to pause > >>>> and > >>>>>>>> resume > >>>>>>>>>>>> processing. If adding a 'pauseAll()' is required, I'd be > >>>> happy > >>>>>> to > >>>>>>>> do > >>>>>>>>>>> that! > >>>>>>>>>>>> > >>>>>>>>>>>> From Guozhang's email, it sounds like this would require > >>>> using > >>>>>> the > >>>>>>>>>>>> rebalance protocol to trigger the coordination. Would > >> there > >>>> be > >>>>>>>> enough > >>>>>>>>>>> room > >>>>>>>>>>>> in that approach to indicate that a named topology is to > >> be > >>>>>> paused > >>>>>>>>>> across > >>>>>>>>>>>> all nodes? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> 2) Would pausing affect standby tasks? For example, > >> imagine > >>>>>> there > >>>>>>>>>> are 3 > >>>>>>>>>>>>> instances A, B, and C. > >>>>>>>>>>>>> A user elects to pause instance C only but it hosts > >> the > >>>>>> standby > >>>>>>>>>>> tasks > >>>>>>>>>>>>> for A. > >>>>>>>>>>>>> Would the standby tasks on the paused application > >>>> continue > >>>>>> to > >>>>>>>> read > >>>>>>>>>>> from > >>>>>>>>>>>>> the changelog topic? > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Yes, standby tasks would continue reading from the > >> changelog > >>>>>> topic. > >>>>>>>>>> All > >>>>>>>>>>>> consumers would continue reading to avoid getting dropped > >>>> from > >>>>>> their > >>>>>>>>>>>> consumer groups. > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers, > >>>>>>>>>>>> > >>>>>>>>>>>> Jim > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>> Bill > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, May 6, 2022 at 2:44 PM Jim Hughes > >>>>>>>>>> <jhug...@confluent.io.invalid > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Guozhang, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the feedback; responses inline below: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, May 6, 2022 at 1:09 PM Guozhang Wang < > >>>>>> wangg...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hello Jim, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks for the proposed KIP. I have some meta questions > >>>>> about > >>>>>> it: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1) Would an instance always pause/resume all of its > >>>> current > >>>>>> owned > >>>>>>>>>>>>>>> topologies (i.e. the named topologies), or are there > >> any > >>>>>>>> scenarios > >>>>>>>>>>>>> where > >>>>>>>>>>>>>> we > >>>>>>>>>>>>>>> only want to pause/resume a subset of them? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> An instance may wish to pause some of its named > >> topologies. > >>>>> I > >>>>>> was > >>>>>>>>>>> unsure > >>>>>>>>>>>>>> what to say about named topologies in the KIP since they > >>>> seem > >>>>>> to > >>>>>>>> be > >>>>>>>>>> an > >>>>>>>>>>>>>> internal detail at the moment. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I intend to add to KafkaStreamsNamedTopologyWrapper > >> methods > >>>>>> like: > >>>>>>>>>>>>>> public void pauseNamedTopology(final String > >>>>>> topologyToPause) > >>>>>>>>>>>>>> public boolean isNamedTopologyPaused(final String > >>>>>> topology) > >>>>>>>>>>>>>> public void resumeNamedTopology(final String > >>>>>>>> topologyToResume) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2) From a user's perspective, do we want to always > >> issue a > >>>>>>>>>>>>> `pause/resume` > >>>>>>>>>>>>>>> to all the instances or not? For example, we can define > >>>> the > >>>>>>>>>> semantics > >>>>>>>>>>>>> of > >>>>>>>>>>>>>>> the function as "you only need to call this function on > >>>> any > >>>>> of > >>>>>>>> the > >>>>>>>>>>>>>>> application's instances, and all instances would then > >>>> pause > >>>>>> (via > >>>>>>>> the > >>>>>>>>>>>>>>> rebalance error codes)", or as "you would call this > >>>> function > >>>>>> for > >>>>>>>> all > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>> instances of an application". Which one are you > >> referring > >>>>> to? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> My initial intent is that one would call this function > >> on > >>>> any > >>>>>>>>>> instances > >>>>>>>>>>>>> of > >>>>>>>>>>>>>> the application that one wishes to pause. This should > >>>> allow > >>>>>> more > >>>>>>>>>>> control > >>>>>>>>>>>>>> (in case one wanted to pause a portion of the > >> instances). > >>>> On > >>>>>> the > >>>>>>>>>> other > >>>>>>>>>>>>>> hand, this approach would put more work on the > >> implementer > >>>> to > >>>>>>>>>>> coordinate > >>>>>>>>>>>>>> calling pause or resume across instances. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> If the other option is more suitable, happy to do that > >>>>> instead. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 3) With EOS, there's a transaction timeout which would > >>>>>> determine > >>>>>>>> how > >>>>>>>>>>>>>> long a > >>>>>>>>>>>>>>> transaction can stay idle before it's force-aborted on > >> the > >>>>>> broker > >>>>>>>>>>>>> side. I > >>>>>>>>>>>>>>> think when a pause is issued, that means we'd need to > >>>>>> immediately > >>>>>>>>>>>>> commit > >>>>>>>>>>>>>>> the current transaction for EOS since we do not know > >> how > >>>>> long > >>>>>> we > >>>>>>>>>> could > >>>>>>>>>>>>>>> pause for. Is that right? If yes could you please > >> clarify > >>>>>> that in > >>>>>>>>>> the > >>>>>>>>>>>>> doc > >>>>>>>>>>>>>>> as well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Good point. My intent is for pause() to wait for the > >> next > >>>>>>>> iteration > >>>>>>>>>>>>>> through `runOnce()` and then only skip over the > >> processing > >>>>> for > >>>>>>>> paused > >>>>>>>>>>>>> tasks > >>>>>>>>>>>>>> in `taskManager.process(numIterations, time)`. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Do commits live inside that call or do they live > >>>>>> across/outside of > >>>>>>>>>> it? > >>>>>>>>>>>>> In > >>>>>>>>>>>>>> the former case, I think there shouldn't be any issues > >> with > >>>>>> EOS. > >>>>>>>>>>>>>> Otherwise, we may need to work through some details to > >> get > >>>>> EOS > >>>>>>>> right. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Once we figure that out, I can update the KIP. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jim > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Guozhang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, May 4, 2022 at 10:51 AM Jim Hughes > >>>>>>>>>>>>> <jhug...@confluent.io.invalid > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I have written up a KIP for adding the ability to > >> pause > >>>> and > >>>>>>>> resume > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> processing of a topology in AK Streams. The KIP is > >> here: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks in advance for your feedback! > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Cheers, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Jim > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>> -- Guozhang > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> -- Guozhang > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> -- Guozhang > >>>>> > >>>> > >> > > >