Hi Jim,

After reading the comments on the KIP, I agree that it makes sense to pause
all activities and any changes can be made later on.

Thanks,
Bill

On Tue, May 10, 2022 at 4:03 AM Bruno Cadonna <cado...@apache.org> wrote:

> Hi Jim,
>
> Thanks for the KIP!
>
> I am fine with the KIP in general.
>
> However, I am with Sophie and John to also pause the standbys for the
> reasons they brought up. Is there a specific reason you want to keep
> standbys going? It feels like premature optimization to me. We can still
> add keeping standby running in a follow up if needed.
>
> Best,
> Bruno
>
> On 10.05.22 05:15, Sophie Blee-Goldman wrote:
> > Thanks Jim, just one note/question on the standby tasks:
> >
> > At the minute, my moderately held position is that standby tasks ought to
> >> continue reading and remain caught up.  If standby tasks would run out
> of
> >> space, there are probably bigger problems.
> >
> >
> > For a single node application, or when the #pause API is invoked on all
> > instances,
> > then there won't be any further active processing and thus nothing to
> keep
> > up with,
> > right? So for that case, it's just a matter of whether any standbys that
> > are lagging
> > will have the chance to catch up to the (paused) active task state before
> > they stop
> > as well, in which case having them continue feels fine to me. However
> this
> > is a
> > relatively trivial benefit and I would only consider it as a deciding
> > factor when all
> > things are equal otherwise.
> >
> > My concern is the more interesting case: when this feature is used to
> pause
> > only
> > one nodes, or some subset of the overall application. In this case, yes,
> > the standby
> > tasks will indeed fall out of sync. But the only reason I can imagine
> > someone using
> > the pause feature in such a way is because there is something going
> wrong,
> > or about
> > to go wrong, on that particular node. For example as mentioned above, if
> > the user
> > wants to cut down on costs without stopping everything, or if the node is
> > about to
> > run out of disk or needs to be debugged or so on. And in this case,
> > continuing to
> > process the standby tasks while other instances continue to run would
> > pretty much
> > defeat the purpose of pausing it entirely, and might have unpleasant
> > consequences
> > for the unsuspecting developer.
> >
> > All that said, I don't want to block this KIP so if you have strong
> > feelings about the
> > standby behavior I'm happy to back down. I'm only pushing back now
> because
> > it
> > felt like there wasn't any particular motivation for the standbys to
> > continue processing
> > or not, and I figured I'd try to fill in this gap with my thoughts on the
> > matter :)
> > Either way we should just make sure that this behavior is documented
> > clearly,
> > since it may be surprising if we decide to only pause active processing
> > (another option
> > is to rename the method something like #pauseProcessing or
> > #pauseActiveProcessing
> > so that it's hard to miss).
> >
> > Thanks! Sorry for the lengthy response, but hopefully we won't need to
> > debate this any
> > further. Beyond this I'm satisfied with the latest proposal
> >
> > On Mon, May 9, 2022 at 5:16 PM John Roesler <vvcep...@apache.org> wrote:
> >
> >> Thanks for the updates, Jim!
> >>
> >> After this discussion and your updates, this KIP looks good to me.
> >>
> >> Thanks,
> >> John
> >>
> >> On Mon, May 9, 2022, at 17:52, Jim Hughes wrote:
> >>> Hi Sophie, all,
> >>>
> >>> I've updated the KIP with feedback from the discussion so far:
> >>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> >>>
> >>> As a terse summary of my current position:
> >>> Pausing will only stop processing and punctuation (respecting modular
> >>> topologies).
> >>> Paused topologies will still a) consume from input topics, b) call the
> >>> usual commit pathways (commits will happen basically as they would
> have),
> >>> and c) standBy tasks will still be processed.
> >>>
> >>> Shout if the KIP or those details still need some TLC.  Responding to
> >>> Sophie inline below.
> >>>
> >>>
> >>> On Mon, May 9, 2022 at 6:06 PM Sophie Blee-Goldman
> >>> <sop...@confluent.io.invalid> wrote:
> >>>
> >>>> Don't worry, I'm going to be adding the APIs for topology-level
> pausing
> >> as
> >>>> part of the modular topologies KIP,
> >>>> so we don't need to worry about that for now. That said, I don't think
> >> we
> >>>> should brush it off entirely and design
> >>>> this feature in a way that's going to be incompatible or hugely raise
> >> the
> >>>> LOE on bringing the (mostly already
> >>>> implemented) modular topologies feature into the public API, just
> >>>> because it "won the race to write a KIP" :)
> >>>>
> >>>
> >>> Yes, I'm hoping that this is all compatible with modular topologies.  I
> >>> haven't seen anything so far which seems to be a problem; this KIP is
> >> just
> >>> in a weird state to discuss details of acting on modular topologies.:)
> >>>
> >>>
> >>>> I may be biased (ok, I definitely am), but I'm not in favor of adding
> >> this
> >>>> as a state regardless of the modular topologies.
> >>>> First of all any change to the KafkaStreams state machine is a
> breaking
> >>>> change, no? So we would have to wait until
> >>>> the next major release which seems like an unnecessary thing to block
> >> on.
> >>>> (Whether to add this as a state to the
> >>>> StreamThread's FSM is an implementation detail).
> >>>>
> >>>
> >>> +1.  I am sold on skipping out on new states.  I had that as a rejected
> >>> alternative in the KIP and have added a few more words to that bit.
> >>>
> >>>
> >>>> Also, the semantics of using an `isPaused` method to distinguish a
> >> paused
> >>>> instance (or topology) make more sense
> >>>> to me -- this is a user-specified status, whereas the KafkaStreams
> >> state is
> >>>> intended to relay the status of the system
> >>>> itself. For example, if we are going to continue to poll during pause,
> >> then
> >>>> shouldn't the client transition to REBALANCING?
> >>>> I believe it makes sense to still allow distinguishing these states
> >> while a
> >>>> client is paused, whereas making PAUSED its
> >>>> own state means you can't tell when the client is rebalancing vs
> >> running,
> >>>> or whether it is paused or dead: presumably
> >>>> the NOT_RUNNING/ERROR state would trump the PAUSED state, which means
> >> you
> >>>> would not be able to rely on
> >>>> checking the state to see if you had called PAUSED on that instance.
> >>>> Obviously you can work around this by just
> >>>> maintaining a flag in the usercode, but all this feels very unnatural
> >> to me
> >>>> vs just checking the `#isPaused` API.
> >>>>
> >>>> On that note, I had one question -- at what point would the
> `#isPaused`
> >>>> check return true? Would it do so immediately
> >>>> after pausing the instance, or only once it has finished committing
> >> offsets
> >>>> and stopped returning records?
> >>>>
> >>>
> >>> Immediately, `#isPaused` tells you about metadata.
> >>>
> >>>
> >>>> Finally, on the note of punctuators I think it would make most sense
> to
> >>>> either pause these as well or else add this an
> >>>> an explicit option for the user. If this feature is used to, for
> >> example,
> >>>> help save on processing costs while an app is
> >>>> not in use, then it would probably be surprising and perhaps alarming
> to
> >>>> see certain kinds of processing still continue.
> >>>>
> >>>
> >>>  From other parts of the discussion, I'm sold on pausing punctuation.
> >>>
> >>>
> >>>> The question of whether to continue fetching for standby tasks is
> maybe
> >> a
> >>>> bit more debatable, as it would certainly be
> >>>> nice to find your clients all caught up when you go to resume the
> >> instance
> >>>> again, but I would still strongly suggest
> >>>> pausing these as well. To use a similar example, imagine if you paused
> >> an
> >>>> app because it was about to run out of
> >>>> disk. If the standbys kept processing and filled up the remaining
> space,
> >>>> you'd probably feel a bit betrayed by this API.
> >>>>
> >>>> WDYT?
> >>>>
> >>>
> >>> At the minute, my moderately held position is that standby tasks ought
> to
> >>> continue reading and remain caught up.  If standby tasks would run out
> of
> >>> space, there are probably bigger problems.
> >>>
> >>> If later it is desirable to manage punctuation or standby tasks, then
> it
> >>> should be easy for future folks to modify things.
> >>>
> >>> Overall, I'd frame this KIP as "pause processing resulting in outputs".
> >>>
> >>> Cheers,
> >>>
> >>> Jim
> >>>
> >>>
> >>>
> >>>> On Mon, May 9, 2022 at 10:33 AM Guozhang Wang <wangg...@gmail.com>
> >> wrote:
> >>>>
> >>>>> I think for named topology we can leave the scope of this KIP as "all
> >> or
> >>>>> nothing", i.e. when you pause an instance you pause all of its
> >>>> topologies.
> >>>>> I raised this question in my previous email just trying to clarify if
> >>>> this
> >>>>> is what you have in mind. We can leave the question of finer
> >> controlled
> >>>>> pausing behavior for later when we have named topology being exposed
> >> via
> >>>>> another KIP.
> >>>>>
> >>>>>
> >>>>> Guozhang
> >>>>>
> >>>>> On Mon, May 9, 2022 at 7:50 AM John Roesler <vvcep...@apache.org>
> >> wrote:
> >>>>>
> >>>>>> Hi Jim,
> >>>>>>
> >>>>>> Thanks for the replies. This all sounds good to me. Just two further
> >>>>>> comments:
> >>>>>>
> >>>>>> 3. It seems like you should aim for the simplest semantics. If the
> >>>> intent
> >>>>>> is to “pause” the instance, then you’d better pause the whole
> >> instance.
> >>>>> If
> >>>>>> you leave punctuations and standbys running, I expect we’d see bug
> >>>>> reports
> >>>>>> come in that the instance isn’t really paused.
> >>>>>>
> >>>>>> 5. Since you won the race to write a KIP, I don’t think it makes too
> >>>> much
> >>>>>> sense to worry too much about modular topologies. When they propose
> >>>> their
> >>>>>> KIP, they will have to specify a lot of state management behavior,
> >> and
> >>>>>> pause/resume will have to be part of it. If they have some concern
> >>>> about
> >>>>>> your KIP, they’ll chime in. It doesn’t make sense for you to try and
> >>>>> guess
> >>>>>> what that proposal will look like.
> >>>>>>
> >>>>>> To be honest, you’re proposing a KafkaStreams runtime-level
> >>>> pause/resume
> >>>>>> function, not a topology-level one anyway, so it seems pretty clear
> >>>> that
> >>>>> it
> >>>>>> would pause the whole runtime (of a single instance) regardless of
> >> any
> >>>>>> modular topologies. If the intent is to pause individual topologies
> >> in
> >>>>> the
> >>>>>> future, you’d need a different API anyway.
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -John
> >>>>>>
> >>>>>> On Mon, May 9, 2022, at 08:10, Jim Hughes wrote:
> >>>>>>> Hi John,
> >>>>>>>
> >>>>>>> Long emails are great; responding inline!
> >>>>>>>
> >>>>>>> On Sat, May 7, 2022 at 4:54 PM John Roesler <vvcep...@apache.org>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for the KIP, Jim!
> >>>>>>>>
> >>>>>>>> This conversation seems to highlight that the KIP needs to
> >> specify
> >>>>>>>> some of its behavior as well as its APIs, where the behavior is
> >>>>>>>> observable and significant to users.
> >>>>>>>>
> >>>>>>>> For example:
> >>>>>>>>
> >>>>>>>> 1. Do you plan to have a guarantee that immediately after
> >>>>>>>> calling KafkaStreams.pause(), users should observe that the
> >> instance
> >>>>>>>> stops processing new records? Or should they expect that the
> >> threads
> >>>>>>>> will continue to process some records and pause asynchronously
> >>>>>>>> (you already answered this in the thread earlier)?
> >>>>>>>>
> >>>>>>>
> >>>>>>> I'm happy to build up to a guarantee of sorts.  My current idea is
> >>>> that
> >>>>>>> pause() does not do anything "exceptional" to get control back
> >> from a
> >>>>>>> running topology.  A currently running topology would get to
> >> complete
> >>>>> its
> >>>>>>> loop.
> >>>>>>>
> >>>>>>> Separately, I'm still piecing together how commits work.  By some
> >>>>>>> mechanism, after a pause, I do agree that the topology needs to
> >>>> commit
> >>>>>> its
> >>>>>>> work in some manner.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 2. Will the threads continue to poll new records until they
> >>>> naturally
> >>>>>> fill
> >>>>>>>> up the task buffers, or will they immediately pause their
> >> Consumers
> >>>>>>>> as well?
> >>>>>>>>
> >>>>>>>
> >>>>>>> Presently, I'm suggesting that consumers would fill up their
> >> buffers.
> >>>>>>>
> >>>>>>>
> >>>>>>>> 3. Will threads continue to call (system time) punctuators, or
> >> would
> >>>>>>>> punctuations also be paused?
> >>>>>>>>
> >>>>>>>
> >>>>>>> In my first pass at thinking through this, I left the punctuators
> >>>>>> running.
> >>>>>>> To be honest, I'm not sure what they do, so my approach is either
> >>>> lucky
> >>>>>> and
> >>>>>>> correct or it could be Very Clearly Wrong.;)
> >>>>>>>
> >>>>>>>
> >>>>>>>> I realize that some of those questions simply may not have
> >> occurred
> >>>> to
> >>>>>>>> you, so this is not a criticism for leaving them off; I'm just
> >>>>> pointing
> >>>>>> out
> >>>>>>>> that although we don't tend to mention implementation details in
> >>>> KIPs,
> >>>>>>>> we also can't be too high level, since there are a lot of
> >>>> operational
> >>>>>>>> details that users rely on to achieve various behaviors in
> >> Streams.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Ayup, I will add some details as we iron out the guarantees,
> >>>>>> implementation
> >>>>>>> details that are at the API level.  This one is tough since
> >> internal
> >>>>>>> features like NamedTopologies are part of the discussion.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> A couple more comments:
> >>>>>>>>
> >>>>>>>> 4. +1 to what Guozhang said. It seems like we should we also do a
> >>>>> commit
> >>>>>>>> before entering the paused state. That way, any open transactions
> >>>>> would
> >>>>>>>> be closed and not have to worry about timing out. Even under
> >> ALOS,
> >>>> it
> >>>>>>>> seems best to go ahead and complete the processing of in-flight
> >>>>> records
> >>>>>>>> by committing. That way, if anything happens to die while it's
> >>>> paused,
> >>>>>>>> existing
> >>>>>>>> work won't have to be repeated. Plus, if there are any processors
> >>>> with
> >>>>>> side
> >>>>>>>> effects, users won't have to tolerate weird edge cases where a
> >> pause
> >>>>>> occurs
> >>>>>>>> after a processor sees a record, but before the result is sent to
> >>>> its
> >>>>>>>> outputs.
> >>>>>>>>
> >>>>>>>> 5. I noticed that you proposed not to add a PAUSED state, but I
> >>>> didn't
> >>>>>>>> follow
> >>>>>>>> the rationale. Adding a state seems beneficial for a number of
> >>>>> reasons:
> >>>>>>>> StreamThreads already use the thread state to determine whether
> >> to
> >>>>>> process
> >>>>>>>> or not, so avoiding a new State would just mean adding a separate
> >>>> flag
> >>>>>> to
> >>>>>>>> track
> >>>>>>>> and then checking your new flag in addition to the State in the
> >>>>> thread.
> >>>>>>>> Also,
> >>>>>>>> operating Streams applications is a non-trivial task, and users
> >> rely
> >>>>> on
> >>>>>>>> the State
> >>>>>>>> (and transitions) to understand Streams's behavior. Adding a
> >> PAUSED
> >>>>>> state
> >>>>>>>> is an elegant way to communicate to operators what is happening
> >> with
> >>>>> the
> >>>>>>>> application. Note that the person digging though logs and
> >> metrics,
> >>>>>> trying
> >>>>>>>> to understand why the application isn't doing anything is
> >> probably
> >>>> not
> >>>>>>>> going
> >>>>>>>> to be the same person who is calling pause() and resume(). Also,
> >> if
> >>>>> you
> >>>>>> add
> >>>>>>>> a state, you don't need `isPaused()`.
> >>>>>>>>
> >>>>>>>> 5b. If you buy the arguments to go ahead and commit as well as
> >> the
> >>>>>>>> argument to add a State, then I'd also suggest to follow the
> >>>> existing
> >>>>>>>> patterns
> >>>>>>>> for the shutdown states by also adding PAUSING. That
> >>>>>>>> way, you'll also expose a way to understand that Streams received
> >>>> the
> >>>>>>>> signal
> >>>>>>>> to pause, and that it's still processing and committing some
> >> records
> >>>>> in
> >>>>>>>> preparation to enter a PAUSED state. I'm not sure if a RESUMING
> >>>> state
> >>>>>> would
> >>>>>>>> also make sense.
> >>>>>>>>
> >>>>>>>
> >>>>>>> I hit a tricky bit when thinking through having a PAUSED
> >> state...  If
> >>>>> one
> >>>>>>> is using Named Topologies, and some of them are paused, what
> >> state is
> >>>>> the
> >>>>>>> Streams instance in?  If we can agree on that, things may become
> >>>>>> clear....
> >>>>>>> I can see two quick ideas:
> >>>>>>>
> >>>>>>> 1.  The state is RUNNING and NamedTopologies have some other way
> >> to
> >>>>>>> indicate state.
> >>>>>>>
> >>>>>>> 2.  The state is something messy like PARTIALLY_PAUSED to reflect
> >>>> that
> >>>>>> the
> >>>>>>> instance has something interesting going on.
> >>>>>>>
> >>>>>>> When I poked at things initially, I did try out having different
> >>>>> states,
> >>>>>>> and I readily agree that a PAUSING state may make sense.
> >> (Especially
> >>>>> if
> >>>>>>> there's a need to run commits before transitioning all the way to
> >>>>>> PAUSED.)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> And that's all I have to say about that. I hope you don't find my
> >>>>>>>> long message offputting. I'm fundamentally in favor of your KIP,
> >>>>>>>> and I think with a little more explanation in the KIP, and a few
> >>>>>>>> small tweaks to the proposal, we'll be able to provide good
> >>>>>>>> ergonomics to our users.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Jim
> >>>>>>>
> >>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -John
> >>>>>>>>
> >>>>>>>> On Sat, May 7, 2022, at 00:06, Guozhang Wang wrote:
> >>>>>>>>> I'm in favor of the "just pausing the instance itself“ option
> >> as
> >>>>>> well. As
> >>>>>>>>> for EOS, the point is that when the processing is paused, we
> >> would
> >>>>> not
> >>>>>>>>> trigger any `producer.send` during the time, and the
> >> transaction
> >>>>>> timeout
> >>>>>>>> is
> >>>>>>>>> sort of relying on that behavior, so my point was that it's
> >>>> probably
> >>>>>>>> better
> >>>>>>>>> to also commit the processing before we pause it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>> On Fri, May 6, 2022 at 6:12 PM Jim Hughes
> >>>>>> <jhug...@confluent.io.invalid>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Matthias,
> >>>>>>>>>>
> >>>>>>>>>> Since the only thing which will be paused is processing the
> >>>>>> topology, I
> >>>>>>>>>> think we can let commits happen naturally.
> >>>>>>>>>>
> >>>>>>>>>> Good point about getting the paused state to new members; it
> >> is
> >>>>>> seeming
> >>>>>>>>>> like the "building block" approach is a good one to keep
> >> things
> >>>>>> simple
> >>>>>>>> at
> >>>>>>>>>> first.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>> Jim
> >>>>>>>>>>
> >>>>>>>>>> On Fri, May 6, 2022 at 8:31 PM Matthias J. Sax <
> >> mj...@apache.org
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I think it's tricky to propagate a pauseAll() via the
> >> rebalance
> >>>>>>>>>>> protocol. New members joining the group would need to get
> >>>> paused,
> >>>>>> too?
> >>>>>>>>>>> Could there be weird race conditions with overlapping
> >>>> pauseAll()
> >>>>>> and
> >>>>>>>>>>> resumeAll() calls on different instanced while there could
> >> be a
> >>>>>>>> errors /
> >>>>>>>>>>> network partitions or similar?
> >>>>>>>>>>>
> >>>>>>>>>>> I would argue that similar to IQ, we provide the basic
> >> building
> >>>>>>>> blocks,
> >>>>>>>>>>> and leave it the user users to implement cross instance
> >>>>> management
> >>>>>>>> for a
> >>>>>>>>>>> pauseAll() scenario. -- Also, if there is really demand, we
> >> can
> >>>>>> always
> >>>>>>>>>>> add pauseAll()/resumeAll() as follow up work.
> >>>>>>>>>>>
> >>>>>>>>>>> About named typologies: I agree to Jim to not include them
> >> in
> >>>>> this
> >>>>>> KIP
> >>>>>>>>>>> as they are not a public feature yet. If we make named
> >>>> typologies
> >>>>>>>>>>> public, the corresponding KIP should extend the pause/resume
> >>>>>> feature
> >>>>>>>>>>> (ie, APIs) accordingly. Of course, the code can (and should)
> >>>>>> already
> >>>>>>>> be
> >>>>>>>>>>> setup to support it to be future proof.
> >>>>>>>>>>>
> >>>>>>>>>>> Good call out about commit and EOS -- to simplify it, I
> >> think
> >>>> it
> >>>>>> might
> >>>>>>>>>>> be good to commit also for the at-least-once case?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> -Matthias
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 5/6/22 1:05 PM, Jim Hughes wrote:
> >>>>>>>>>>>> Hi Bill,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Great questions; I'll do my best to reply inline:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, May 6, 2022 at 3:21 PM Bill Bejeck <
> >>>> bbej...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Jim,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the KIP.  I have a couple of meta-questions as
> >>>>> well:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) Regarding pausing only a subset of running instances,
> >> I'm
> >>>>>>>> thinking
> >>>>>>>>>>> there
> >>>>>>>>>>>>> may be a use case for pausing all of them.
> >>>>>>>>>>>>>      Would it make sense to also allow for pausing all
> >>>>> instances
> >>>>>> by
> >>>>>>>>>>> adding a
> >>>>>>>>>>>>> method `pauseAll()` or something similar?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Honestly, I'm indifferent on this point.  Presently, I
> >> think
> >>>>>> what I
> >>>>>>>>>> have
> >>>>>>>>>>>> proposed is the minimal change to get the ability to pause
> >>>> and
> >>>>>>>> resume
> >>>>>>>>>>>> processing.  If adding a 'pauseAll()' is required, I'd be
> >>>> happy
> >>>>>> to
> >>>>>>>> do
> >>>>>>>>>>> that!
> >>>>>>>>>>>>
> >>>>>>>>>>>>   From Guozhang's email, it sounds like this would require
> >>>> using
> >>>>>> the
> >>>>>>>>>>>> rebalance protocol to trigger the coordination.  Would
> >> there
> >>>> be
> >>>>>>>> enough
> >>>>>>>>>>> room
> >>>>>>>>>>>> in that approach to indicate that a named topology is to
> >> be
> >>>>>> paused
> >>>>>>>>>> across
> >>>>>>>>>>>> all nodes?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 2) Would pausing affect standby tasks?  For example,
> >> imagine
> >>>>>> there
> >>>>>>>>>> are 3
> >>>>>>>>>>>>> instances A, B, and C.
> >>>>>>>>>>>>>      A user elects to pause instance C only but it hosts
> >> the
> >>>>>> standby
> >>>>>>>>>>> tasks
> >>>>>>>>>>>>> for A.
> >>>>>>>>>>>>>      Would the standby tasks on the paused application
> >>>> continue
> >>>>>> to
> >>>>>>>> read
> >>>>>>>>>>> from
> >>>>>>>>>>>>> the changelog topic?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, standby tasks would continue reading from the
> >> changelog
> >>>>>> topic.
> >>>>>>>>>> All
> >>>>>>>>>>>> consumers would continue reading to avoid getting dropped
> >>>> from
> >>>>>> their
> >>>>>>>>>>>> consumer groups.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jim
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>> Bill
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, May 6, 2022 at 2:44 PM Jim Hughes
> >>>>>>>>>> <jhug...@confluent.io.invalid
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Guozhang,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the feedback; responses inline below:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, May 6, 2022 at 1:09 PM Guozhang Wang <
> >>>>>> wangg...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello Jim,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the proposed KIP. I have some meta questions
> >>>>> about
> >>>>>> it:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1) Would an instance always pause/resume all of its
> >>>> current
> >>>>>> owned
> >>>>>>>>>>>>>>> topologies (i.e. the named topologies), or are there
> >> any
> >>>>>>>> scenarios
> >>>>>>>>>>>>> where
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>> only want to pause/resume a subset of them?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> An instance may wish to pause some of its named
> >> topologies.
> >>>>> I
> >>>>>> was
> >>>>>>>>>>> unsure
> >>>>>>>>>>>>>> what to say about named topologies in the KIP since they
> >>>> seem
> >>>>>> to
> >>>>>>>> be
> >>>>>>>>>> an
> >>>>>>>>>>>>>> internal detail at the moment.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I intend to add to KafkaStreamsNamedTopologyWrapper
> >> methods
> >>>>>> like:
> >>>>>>>>>>>>>>       public void pauseNamedTopology(final String
> >>>>>> topologyToPause)
> >>>>>>>>>>>>>>       public boolean isNamedTopologyPaused(final String
> >>>>>> topology)
> >>>>>>>>>>>>>>       public void resumeNamedTopology(final String
> >>>>>>>> topologyToResume)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) From a user's perspective, do we want to always
> >> issue a
> >>>>>>>>>>>>> `pause/resume`
> >>>>>>>>>>>>>>> to all the instances or not? For example, we can define
> >>>> the
> >>>>>>>>>> semantics
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>> the function as "you only need to call this function on
> >>>> any
> >>>>> of
> >>>>>>>> the
> >>>>>>>>>>>>>>> application's instances, and all instances would then
> >>>> pause
> >>>>>> (via
> >>>>>>>> the
> >>>>>>>>>>>>>>> rebalance error codes)", or as "you would call this
> >>>> function
> >>>>>> for
> >>>>>>>> all
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> instances of an application". Which one are you
> >> referring
> >>>>> to?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> My initial intent is that one would call this function
> >> on
> >>>> any
> >>>>>>>>>> instances
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>> the application that one wishes to pause.  This should
> >>>> allow
> >>>>>> more
> >>>>>>>>>>> control
> >>>>>>>>>>>>>> (in case one wanted to pause a portion of the
> >> instances).
> >>>> On
> >>>>>> the
> >>>>>>>>>> other
> >>>>>>>>>>>>>> hand, this approach would put more work on the
> >> implementer
> >>>> to
> >>>>>>>>>>> coordinate
> >>>>>>>>>>>>>> calling pause or resume across instances.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If the other option is more suitable, happy to do that
> >>>>> instead.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3) With EOS, there's a transaction timeout which would
> >>>>>> determine
> >>>>>>>> how
> >>>>>>>>>>>>>> long a
> >>>>>>>>>>>>>>> transaction can stay idle before it's force-aborted on
> >> the
> >>>>>> broker
> >>>>>>>>>>>>> side. I
> >>>>>>>>>>>>>>> think when a pause is issued, that means we'd need to
> >>>>>> immediately
> >>>>>>>>>>>>> commit
> >>>>>>>>>>>>>>> the current transaction for EOS since we do not know
> >> how
> >>>>> long
> >>>>>> we
> >>>>>>>>>> could
> >>>>>>>>>>>>>>> pause for. Is that right? If yes could you please
> >> clarify
> >>>>>> that in
> >>>>>>>>>> the
> >>>>>>>>>>>>> doc
> >>>>>>>>>>>>>>> as well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Good point.  My intent is for pause() to wait for the
> >> next
> >>>>>>>> iteration
> >>>>>>>>>>>>>> through `runOnce()` and then only skip over the
> >> processing
> >>>>> for
> >>>>>>>> paused
> >>>>>>>>>>>>> tasks
> >>>>>>>>>>>>>> in `taskManager.process(numIterations, time)`.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Do commits live inside that call or do they live
> >>>>>> across/outside of
> >>>>>>>>>> it?
> >>>>>>>>>>>>> In
> >>>>>>>>>>>>>> the former case, I think there shouldn't be any issues
> >> with
> >>>>>> EOS.
> >>>>>>>>>>>>>> Otherwise, we may need to work through some details to
> >> get
> >>>>> EOS
> >>>>>>>> right.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Once we figure that out, I can update the KIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jim
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, May 4, 2022 at 10:51 AM Jim Hughes
> >>>>>>>>>>>>> <jhug...@confluent.io.invalid
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have written up a KIP for adding the ability to
> >> pause
> >>>> and
> >>>>>>>> resume
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> processing of a topology in AK Streams.  The KIP is
> >> here:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks in advance for your feedback!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Jim
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> -- Guozhang
> >>>>>
> >>>>
> >>
> >
>

Reply via email to