Hi Bill, all,

Thank you.  I've updated the KIP to reflect pausing standby tasks as well.
I think all the outstanding points have been addressed and I'm going to
start the vote thread!

Cheers,

Jim



On Tue, May 10, 2022 at 2:43 PM Bill Bejeck <bbej...@gmail.com> wrote:

> Hi Jim,
>
> After reading the comments on the KIP, I agree that it makes sense to pause
> all activities and any changes can be made later on.
>
> Thanks,
> Bill
>
> On Tue, May 10, 2022 at 4:03 AM Bruno Cadonna <cado...@apache.org> wrote:
>
> > Hi Jim,
> >
> > Thanks for the KIP!
> >
> > I am fine with the KIP in general.
> >
> > However, I am with Sophie and John to also pause the standbys for the
> > reasons they brought up. Is there a specific reason you want to keep
> > standbys going? It feels like premature optimization to me. We can still
> > add keeping standby running in a follow up if needed.
> >
> > Best,
> > Bruno
> >
> > On 10.05.22 05:15, Sophie Blee-Goldman wrote:
> > > Thanks Jim, just one note/question on the standby tasks:
> > >
> > > At the minute, my moderately held position is that standby tasks ought
> to
> > >> continue reading and remain caught up.  If standby tasks would run out
> > of
> > >> space, there are probably bigger problems.
> > >
> > >
> > > For a single node application, or when the #pause API is invoked on all
> > > instances,
> > > then there won't be any further active processing and thus nothing to
> > keep
> > > up with,
> > > right? So for that case, it's just a matter of whether any standbys
> that
> > > are lagging
> > > will have the chance to catch up to the (paused) active task state
> before
> > > they stop
> > > as well, in which case having them continue feels fine to me. However
> > this
> > > is a
> > > relatively trivial benefit and I would only consider it as a deciding
> > > factor when all
> > > things are equal otherwise.
> > >
> > > My concern is the more interesting case: when this feature is used to
> > pause
> > > only
> > > one nodes, or some subset of the overall application. In this case,
> yes,
> > > the standby
> > > tasks will indeed fall out of sync. But the only reason I can imagine
> > > someone using
> > > the pause feature in such a way is because there is something going
> > wrong,
> > > or about
> > > to go wrong, on that particular node. For example as mentioned above,
> if
> > > the user
> > > wants to cut down on costs without stopping everything, or if the node
> is
> > > about to
> > > run out of disk or needs to be debugged or so on. And in this case,
> > > continuing to
> > > process the standby tasks while other instances continue to run would
> > > pretty much
> > > defeat the purpose of pausing it entirely, and might have unpleasant
> > > consequences
> > > for the unsuspecting developer.
> > >
> > > All that said, I don't want to block this KIP so if you have strong
> > > feelings about the
> > > standby behavior I'm happy to back down. I'm only pushing back now
> > because
> > > it
> > > felt like there wasn't any particular motivation for the standbys to
> > > continue processing
> > > or not, and I figured I'd try to fill in this gap with my thoughts on
> the
> > > matter :)
> > > Either way we should just make sure that this behavior is documented
> > > clearly,
> > > since it may be surprising if we decide to only pause active processing
> > > (another option
> > > is to rename the method something like #pauseProcessing or
> > > #pauseActiveProcessing
> > > so that it's hard to miss).
> > >
> > > Thanks! Sorry for the lengthy response, but hopefully we won't need to
> > > debate this any
> > > further. Beyond this I'm satisfied with the latest proposal
> > >
> > > On Mon, May 9, 2022 at 5:16 PM John Roesler <vvcep...@apache.org>
> wrote:
> > >
> > >> Thanks for the updates, Jim!
> > >>
> > >> After this discussion and your updates, this KIP looks good to me.
> > >>
> > >> Thanks,
> > >> John
> > >>
> > >> On Mon, May 9, 2022, at 17:52, Jim Hughes wrote:
> > >>> Hi Sophie, all,
> > >>>
> > >>> I've updated the KIP with feedback from the discussion so far:
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> > >>>
> > >>> As a terse summary of my current position:
> > >>> Pausing will only stop processing and punctuation (respecting modular
> > >>> topologies).
> > >>> Paused topologies will still a) consume from input topics, b) call
> the
> > >>> usual commit pathways (commits will happen basically as they would
> > have),
> > >>> and c) standBy tasks will still be processed.
> > >>>
> > >>> Shout if the KIP or those details still need some TLC.  Responding to
> > >>> Sophie inline below.
> > >>>
> > >>>
> > >>> On Mon, May 9, 2022 at 6:06 PM Sophie Blee-Goldman
> > >>> <sop...@confluent.io.invalid> wrote:
> > >>>
> > >>>> Don't worry, I'm going to be adding the APIs for topology-level
> > pausing
> > >> as
> > >>>> part of the modular topologies KIP,
> > >>>> so we don't need to worry about that for now. That said, I don't
> think
> > >> we
> > >>>> should brush it off entirely and design
> > >>>> this feature in a way that's going to be incompatible or hugely
> raise
> > >> the
> > >>>> LOE on bringing the (mostly already
> > >>>> implemented) modular topologies feature into the public API, just
> > >>>> because it "won the race to write a KIP" :)
> > >>>>
> > >>>
> > >>> Yes, I'm hoping that this is all compatible with modular
> topologies.  I
> > >>> haven't seen anything so far which seems to be a problem; this KIP is
> > >> just
> > >>> in a weird state to discuss details of acting on modular
> topologies.:)
> > >>>
> > >>>
> > >>>> I may be biased (ok, I definitely am), but I'm not in favor of
> adding
> > >> this
> > >>>> as a state regardless of the modular topologies.
> > >>>> First of all any change to the KafkaStreams state machine is a
> > breaking
> > >>>> change, no? So we would have to wait until
> > >>>> the next major release which seems like an unnecessary thing to
> block
> > >> on.
> > >>>> (Whether to add this as a state to the
> > >>>> StreamThread's FSM is an implementation detail).
> > >>>>
> > >>>
> > >>> +1.  I am sold on skipping out on new states.  I had that as a
> rejected
> > >>> alternative in the KIP and have added a few more words to that bit.
> > >>>
> > >>>
> > >>>> Also, the semantics of using an `isPaused` method to distinguish a
> > >> paused
> > >>>> instance (or topology) make more sense
> > >>>> to me -- this is a user-specified status, whereas the KafkaStreams
> > >> state is
> > >>>> intended to relay the status of the system
> > >>>> itself. For example, if we are going to continue to poll during
> pause,
> > >> then
> > >>>> shouldn't the client transition to REBALANCING?
> > >>>> I believe it makes sense to still allow distinguishing these states
> > >> while a
> > >>>> client is paused, whereas making PAUSED its
> > >>>> own state means you can't tell when the client is rebalancing vs
> > >> running,
> > >>>> or whether it is paused or dead: presumably
> > >>>> the NOT_RUNNING/ERROR state would trump the PAUSED state, which
> means
> > >> you
> > >>>> would not be able to rely on
> > >>>> checking the state to see if you had called PAUSED on that instance.
> > >>>> Obviously you can work around this by just
> > >>>> maintaining a flag in the usercode, but all this feels very
> unnatural
> > >> to me
> > >>>> vs just checking the `#isPaused` API.
> > >>>>
> > >>>> On that note, I had one question -- at what point would the
> > `#isPaused`
> > >>>> check return true? Would it do so immediately
> > >>>> after pausing the instance, or only once it has finished committing
> > >> offsets
> > >>>> and stopped returning records?
> > >>>>
> > >>>
> > >>> Immediately, `#isPaused` tells you about metadata.
> > >>>
> > >>>
> > >>>> Finally, on the note of punctuators I think it would make most sense
> > to
> > >>>> either pause these as well or else add this an
> > >>>> an explicit option for the user. If this feature is used to, for
> > >> example,
> > >>>> help save on processing costs while an app is
> > >>>> not in use, then it would probably be surprising and perhaps
> alarming
> > to
> > >>>> see certain kinds of processing still continue.
> > >>>>
> > >>>
> > >>>  From other parts of the discussion, I'm sold on pausing punctuation.
> > >>>
> > >>>
> > >>>> The question of whether to continue fetching for standby tasks is
> > maybe
> > >> a
> > >>>> bit more debatable, as it would certainly be
> > >>>> nice to find your clients all caught up when you go to resume the
> > >> instance
> > >>>> again, but I would still strongly suggest
> > >>>> pausing these as well. To use a similar example, imagine if you
> paused
> > >> an
> > >>>> app because it was about to run out of
> > >>>> disk. If the standbys kept processing and filled up the remaining
> > space,
> > >>>> you'd probably feel a bit betrayed by this API.
> > >>>>
> > >>>> WDYT?
> > >>>>
> > >>>
> > >>> At the minute, my moderately held position is that standby tasks
> ought
> > to
> > >>> continue reading and remain caught up.  If standby tasks would run
> out
> > of
> > >>> space, there are probably bigger problems.
> > >>>
> > >>> If later it is desirable to manage punctuation or standby tasks, then
> > it
> > >>> should be easy for future folks to modify things.
> > >>>
> > >>> Overall, I'd frame this KIP as "pause processing resulting in
> outputs".
> > >>>
> > >>> Cheers,
> > >>>
> > >>> Jim
> > >>>
> > >>>
> > >>>
> > >>>> On Mon, May 9, 2022 at 10:33 AM Guozhang Wang <wangg...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> I think for named topology we can leave the scope of this KIP as
> "all
> > >> or
> > >>>>> nothing", i.e. when you pause an instance you pause all of its
> > >>>> topologies.
> > >>>>> I raised this question in my previous email just trying to clarify
> if
> > >>>> this
> > >>>>> is what you have in mind. We can leave the question of finer
> > >> controlled
> > >>>>> pausing behavior for later when we have named topology being
> exposed
> > >> via
> > >>>>> another KIP.
> > >>>>>
> > >>>>>
> > >>>>> Guozhang
> > >>>>>
> > >>>>> On Mon, May 9, 2022 at 7:50 AM John Roesler <vvcep...@apache.org>
> > >> wrote:
> > >>>>>
> > >>>>>> Hi Jim,
> > >>>>>>
> > >>>>>> Thanks for the replies. This all sounds good to me. Just two
> further
> > >>>>>> comments:
> > >>>>>>
> > >>>>>> 3. It seems like you should aim for the simplest semantics. If the
> > >>>> intent
> > >>>>>> is to “pause” the instance, then you’d better pause the whole
> > >> instance.
> > >>>>> If
> > >>>>>> you leave punctuations and standbys running, I expect we’d see bug
> > >>>>> reports
> > >>>>>> come in that the instance isn’t really paused.
> > >>>>>>
> > >>>>>> 5. Since you won the race to write a KIP, I don’t think it makes
> too
> > >>>> much
> > >>>>>> sense to worry too much about modular topologies. When they
> propose
> > >>>> their
> > >>>>>> KIP, they will have to specify a lot of state management behavior,
> > >> and
> > >>>>>> pause/resume will have to be part of it. If they have some concern
> > >>>> about
> > >>>>>> your KIP, they’ll chime in. It doesn’t make sense for you to try
> and
> > >>>>> guess
> > >>>>>> what that proposal will look like.
> > >>>>>>
> > >>>>>> To be honest, you’re proposing a KafkaStreams runtime-level
> > >>>> pause/resume
> > >>>>>> function, not a topology-level one anyway, so it seems pretty
> clear
> > >>>> that
> > >>>>> it
> > >>>>>> would pause the whole runtime (of a single instance) regardless of
> > >> any
> > >>>>>> modular topologies. If the intent is to pause individual
> topologies
> > >> in
> > >>>>> the
> > >>>>>> future, you’d need a different API anyway.
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> -John
> > >>>>>>
> > >>>>>> On Mon, May 9, 2022, at 08:10, Jim Hughes wrote:
> > >>>>>>> Hi John,
> > >>>>>>>
> > >>>>>>> Long emails are great; responding inline!
> > >>>>>>>
> > >>>>>>> On Sat, May 7, 2022 at 4:54 PM John Roesler <vvcep...@apache.org
> >
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Thanks for the KIP, Jim!
> > >>>>>>>>
> > >>>>>>>> This conversation seems to highlight that the KIP needs to
> > >> specify
> > >>>>>>>> some of its behavior as well as its APIs, where the behavior is
> > >>>>>>>> observable and significant to users.
> > >>>>>>>>
> > >>>>>>>> For example:
> > >>>>>>>>
> > >>>>>>>> 1. Do you plan to have a guarantee that immediately after
> > >>>>>>>> calling KafkaStreams.pause(), users should observe that the
> > >> instance
> > >>>>>>>> stops processing new records? Or should they expect that the
> > >> threads
> > >>>>>>>> will continue to process some records and pause asynchronously
> > >>>>>>>> (you already answered this in the thread earlier)?
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> I'm happy to build up to a guarantee of sorts.  My current idea
> is
> > >>>> that
> > >>>>>>> pause() does not do anything "exceptional" to get control back
> > >> from a
> > >>>>>>> running topology.  A currently running topology would get to
> > >> complete
> > >>>>> its
> > >>>>>>> loop.
> > >>>>>>>
> > >>>>>>> Separately, I'm still piecing together how commits work.  By some
> > >>>>>>> mechanism, after a pause, I do agree that the topology needs to
> > >>>> commit
> > >>>>>> its
> > >>>>>>> work in some manner.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> 2. Will the threads continue to poll new records until they
> > >>>> naturally
> > >>>>>> fill
> > >>>>>>>> up the task buffers, or will they immediately pause their
> > >> Consumers
> > >>>>>>>> as well?
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> Presently, I'm suggesting that consumers would fill up their
> > >> buffers.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> 3. Will threads continue to call (system time) punctuators, or
> > >> would
> > >>>>>>>> punctuations also be paused?
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> In my first pass at thinking through this, I left the punctuators
> > >>>>>> running.
> > >>>>>>> To be honest, I'm not sure what they do, so my approach is either
> > >>>> lucky
> > >>>>>> and
> > >>>>>>> correct or it could be Very Clearly Wrong.;)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> I realize that some of those questions simply may not have
> > >> occurred
> > >>>> to
> > >>>>>>>> you, so this is not a criticism for leaving them off; I'm just
> > >>>>> pointing
> > >>>>>> out
> > >>>>>>>> that although we don't tend to mention implementation details in
> > >>>> KIPs,
> > >>>>>>>> we also can't be too high level, since there are a lot of
> > >>>> operational
> > >>>>>>>> details that users rely on to achieve various behaviors in
> > >> Streams.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> Ayup, I will add some details as we iron out the guarantees,
> > >>>>>> implementation
> > >>>>>>> details that are at the API level.  This one is tough since
> > >> internal
> > >>>>>>> features like NamedTopologies are part of the discussion.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> A couple more comments:
> > >>>>>>>>
> > >>>>>>>> 4. +1 to what Guozhang said. It seems like we should we also do
> a
> > >>>>> commit
> > >>>>>>>> before entering the paused state. That way, any open
> transactions
> > >>>>> would
> > >>>>>>>> be closed and not have to worry about timing out. Even under
> > >> ALOS,
> > >>>> it
> > >>>>>>>> seems best to go ahead and complete the processing of in-flight
> > >>>>> records
> > >>>>>>>> by committing. That way, if anything happens to die while it's
> > >>>> paused,
> > >>>>>>>> existing
> > >>>>>>>> work won't have to be repeated. Plus, if there are any
> processors
> > >>>> with
> > >>>>>> side
> > >>>>>>>> effects, users won't have to tolerate weird edge cases where a
> > >> pause
> > >>>>>> occurs
> > >>>>>>>> after a processor sees a record, but before the result is sent
> to
> > >>>> its
> > >>>>>>>> outputs.
> > >>>>>>>>
> > >>>>>>>> 5. I noticed that you proposed not to add a PAUSED state, but I
> > >>>> didn't
> > >>>>>>>> follow
> > >>>>>>>> the rationale. Adding a state seems beneficial for a number of
> > >>>>> reasons:
> > >>>>>>>> StreamThreads already use the thread state to determine whether
> > >> to
> > >>>>>> process
> > >>>>>>>> or not, so avoiding a new State would just mean adding a
> separate
> > >>>> flag
> > >>>>>> to
> > >>>>>>>> track
> > >>>>>>>> and then checking your new flag in addition to the State in the
> > >>>>> thread.
> > >>>>>>>> Also,
> > >>>>>>>> operating Streams applications is a non-trivial task, and users
> > >> rely
> > >>>>> on
> > >>>>>>>> the State
> > >>>>>>>> (and transitions) to understand Streams's behavior. Adding a
> > >> PAUSED
> > >>>>>> state
> > >>>>>>>> is an elegant way to communicate to operators what is happening
> > >> with
> > >>>>> the
> > >>>>>>>> application. Note that the person digging though logs and
> > >> metrics,
> > >>>>>> trying
> > >>>>>>>> to understand why the application isn't doing anything is
> > >> probably
> > >>>> not
> > >>>>>>>> going
> > >>>>>>>> to be the same person who is calling pause() and resume(). Also,
> > >> if
> > >>>>> you
> > >>>>>> add
> > >>>>>>>> a state, you don't need `isPaused()`.
> > >>>>>>>>
> > >>>>>>>> 5b. If you buy the arguments to go ahead and commit as well as
> > >> the
> > >>>>>>>> argument to add a State, then I'd also suggest to follow the
> > >>>> existing
> > >>>>>>>> patterns
> > >>>>>>>> for the shutdown states by also adding PAUSING. That
> > >>>>>>>> way, you'll also expose a way to understand that Streams
> received
> > >>>> the
> > >>>>>>>> signal
> > >>>>>>>> to pause, and that it's still processing and committing some
> > >> records
> > >>>>> in
> > >>>>>>>> preparation to enter a PAUSED state. I'm not sure if a RESUMING
> > >>>> state
> > >>>>>> would
> > >>>>>>>> also make sense.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> I hit a tricky bit when thinking through having a PAUSED
> > >> state...  If
> > >>>>> one
> > >>>>>>> is using Named Topologies, and some of them are paused, what
> > >> state is
> > >>>>> the
> > >>>>>>> Streams instance in?  If we can agree on that, things may become
> > >>>>>> clear....
> > >>>>>>> I can see two quick ideas:
> > >>>>>>>
> > >>>>>>> 1.  The state is RUNNING and NamedTopologies have some other way
> > >> to
> > >>>>>>> indicate state.
> > >>>>>>>
> > >>>>>>> 2.  The state is something messy like PARTIALLY_PAUSED to reflect
> > >>>> that
> > >>>>>> the
> > >>>>>>> instance has something interesting going on.
> > >>>>>>>
> > >>>>>>> When I poked at things initially, I did try out having different
> > >>>>> states,
> > >>>>>>> and I readily agree that a PAUSING state may make sense.
> > >> (Especially
> > >>>>> if
> > >>>>>>> there's a need to run commits before transitioning all the way to
> > >>>>>> PAUSED.)
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> And that's all I have to say about that. I hope you don't find
> my
> > >>>>>>>> long message offputting. I'm fundamentally in favor of your KIP,
> > >>>>>>>> and I think with a little more explanation in the KIP, and a few
> > >>>>>>>> small tweaks to the proposal, we'll be able to provide good
> > >>>>>>>> ergonomics to our users.
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>> Jim
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> -John
> > >>>>>>>>
> > >>>>>>>> On Sat, May 7, 2022, at 00:06, Guozhang Wang wrote:
> > >>>>>>>>> I'm in favor of the "just pausing the instance itself“ option
> > >> as
> > >>>>>> well. As
> > >>>>>>>>> for EOS, the point is that when the processing is paused, we
> > >> would
> > >>>>> not
> > >>>>>>>>> trigger any `producer.send` during the time, and the
> > >> transaction
> > >>>>>> timeout
> > >>>>>>>> is
> > >>>>>>>>> sort of relying on that behavior, so my point was that it's
> > >>>> probably
> > >>>>>>>> better
> > >>>>>>>>> to also commit the processing before we pause it.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Guozhang
> > >>>>>>>>>
> > >>>>>>>>> On Fri, May 6, 2022 at 6:12 PM Jim Hughes
> > >>>>>> <jhug...@confluent.io.invalid>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi Matthias,
> > >>>>>>>>>>
> > >>>>>>>>>> Since the only thing which will be paused is processing the
> > >>>>>> topology, I
> > >>>>>>>>>> think we can let commits happen naturally.
> > >>>>>>>>>>
> > >>>>>>>>>> Good point about getting the paused state to new members; it
> > >> is
> > >>>>>> seeming
> > >>>>>>>>>> like the "building block" approach is a good one to keep
> > >> things
> > >>>>>> simple
> > >>>>>>>> at
> > >>>>>>>>>> first.
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>>
> > >>>>>>>>>> Jim
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, May 6, 2022 at 8:31 PM Matthias J. Sax <
> > >> mj...@apache.org
> > >>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I think it's tricky to propagate a pauseAll() via the
> > >> rebalance
> > >>>>>>>>>>> protocol. New members joining the group would need to get
> > >>>> paused,
> > >>>>>> too?
> > >>>>>>>>>>> Could there be weird race conditions with overlapping
> > >>>> pauseAll()
> > >>>>>> and
> > >>>>>>>>>>> resumeAll() calls on different instanced while there could
> > >> be a
> > >>>>>>>> errors /
> > >>>>>>>>>>> network partitions or similar?
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would argue that similar to IQ, we provide the basic
> > >> building
> > >>>>>>>> blocks,
> > >>>>>>>>>>> and leave it the user users to implement cross instance
> > >>>>> management
> > >>>>>>>> for a
> > >>>>>>>>>>> pauseAll() scenario. -- Also, if there is really demand, we
> > >> can
> > >>>>>> always
> > >>>>>>>>>>> add pauseAll()/resumeAll() as follow up work.
> > >>>>>>>>>>>
> > >>>>>>>>>>> About named typologies: I agree to Jim to not include them
> > >> in
> > >>>>> this
> > >>>>>> KIP
> > >>>>>>>>>>> as they are not a public feature yet. If we make named
> > >>>> typologies
> > >>>>>>>>>>> public, the corresponding KIP should extend the pause/resume
> > >>>>>> feature
> > >>>>>>>>>>> (ie, APIs) accordingly. Of course, the code can (and should)
> > >>>>>> already
> > >>>>>>>> be
> > >>>>>>>>>>> setup to support it to be future proof.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Good call out about commit and EOS -- to simplify it, I
> > >> think
> > >>>> it
> > >>>>>> might
> > >>>>>>>>>>> be good to commit also for the at-least-once case?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> -Matthias
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 5/6/22 1:05 PM, Jim Hughes wrote:
> > >>>>>>>>>>>> Hi Bill,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Great questions; I'll do my best to reply inline:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Fri, May 6, 2022 at 3:21 PM Bill Bejeck <
> > >>>> bbej...@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Jim,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the KIP.  I have a couple of meta-questions as
> > >>>>> well:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1) Regarding pausing only a subset of running instances,
> > >> I'm
> > >>>>>>>> thinking
> > >>>>>>>>>>> there
> > >>>>>>>>>>>>> may be a use case for pausing all of them.
> > >>>>>>>>>>>>>      Would it make sense to also allow for pausing all
> > >>>>> instances
> > >>>>>> by
> > >>>>>>>>>>> adding a
> > >>>>>>>>>>>>> method `pauseAll()` or something similar?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Honestly, I'm indifferent on this point.  Presently, I
> > >> think
> > >>>>>> what I
> > >>>>>>>>>> have
> > >>>>>>>>>>>> proposed is the minimal change to get the ability to pause
> > >>>> and
> > >>>>>>>> resume
> > >>>>>>>>>>>> processing.  If adding a 'pauseAll()' is required, I'd be
> > >>>> happy
> > >>>>>> to
> > >>>>>>>> do
> > >>>>>>>>>>> that!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>   From Guozhang's email, it sounds like this would require
> > >>>> using
> > >>>>>> the
> > >>>>>>>>>>>> rebalance protocol to trigger the coordination.  Would
> > >> there
> > >>>> be
> > >>>>>>>> enough
> > >>>>>>>>>>> room
> > >>>>>>>>>>>> in that approach to indicate that a named topology is to
> > >> be
> > >>>>>> paused
> > >>>>>>>>>> across
> > >>>>>>>>>>>> all nodes?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> 2) Would pausing affect standby tasks?  For example,
> > >> imagine
> > >>>>>> there
> > >>>>>>>>>> are 3
> > >>>>>>>>>>>>> instances A, B, and C.
> > >>>>>>>>>>>>>      A user elects to pause instance C only but it hosts
> > >> the
> > >>>>>> standby
> > >>>>>>>>>>> tasks
> > >>>>>>>>>>>>> for A.
> > >>>>>>>>>>>>>      Would the standby tasks on the paused application
> > >>>> continue
> > >>>>>> to
> > >>>>>>>> read
> > >>>>>>>>>>> from
> > >>>>>>>>>>>>> the changelog topic?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Yes, standby tasks would continue reading from the
> > >> changelog
> > >>>>>> topic.
> > >>>>>>>>>> All
> > >>>>>>>>>>>> consumers would continue reading to avoid getting dropped
> > >>>> from
> > >>>>>> their
> > >>>>>>>>>>>> consumer groups.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Jim
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>> Bill
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, May 6, 2022 at 2:44 PM Jim Hughes
> > >>>>>>>>>> <jhug...@confluent.io.invalid
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi Guozhang,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for the feedback; responses inline below:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, May 6, 2022 at 1:09 PM Guozhang Wang <
> > >>>>>> wangg...@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Hello Jim,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for the proposed KIP. I have some meta questions
> > >>>>> about
> > >>>>>> it:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 1) Would an instance always pause/resume all of its
> > >>>> current
> > >>>>>> owned
> > >>>>>>>>>>>>>>> topologies (i.e. the named topologies), or are there
> > >> any
> > >>>>>>>> scenarios
> > >>>>>>>>>>>>> where
> > >>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>> only want to pause/resume a subset of them?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> An instance may wish to pause some of its named
> > >> topologies.
> > >>>>> I
> > >>>>>> was
> > >>>>>>>>>>> unsure
> > >>>>>>>>>>>>>> what to say about named topologies in the KIP since they
> > >>>> seem
> > >>>>>> to
> > >>>>>>>> be
> > >>>>>>>>>> an
> > >>>>>>>>>>>>>> internal detail at the moment.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I intend to add to KafkaStreamsNamedTopologyWrapper
> > >> methods
> > >>>>>> like:
> > >>>>>>>>>>>>>>       public void pauseNamedTopology(final String
> > >>>>>> topologyToPause)
> > >>>>>>>>>>>>>>       public boolean isNamedTopologyPaused(final String
> > >>>>>> topology)
> > >>>>>>>>>>>>>>       public void resumeNamedTopology(final String
> > >>>>>>>> topologyToResume)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 2) From a user's perspective, do we want to always
> > >> issue a
> > >>>>>>>>>>>>> `pause/resume`
> > >>>>>>>>>>>>>>> to all the instances or not? For example, we can define
> > >>>> the
> > >>>>>>>>>> semantics
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> the function as "you only need to call this function on
> > >>>> any
> > >>>>> of
> > >>>>>>>> the
> > >>>>>>>>>>>>>>> application's instances, and all instances would then
> > >>>> pause
> > >>>>>> (via
> > >>>>>>>> the
> > >>>>>>>>>>>>>>> rebalance error codes)", or as "you would call this
> > >>>> function
> > >>>>>> for
> > >>>>>>>> all
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>> instances of an application". Which one are you
> > >> referring
> > >>>>> to?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> My initial intent is that one would call this function
> > >> on
> > >>>> any
> > >>>>>>>>>> instances
> > >>>>>>>>>>>>> of
> > >>>>>>>>>>>>>> the application that one wishes to pause.  This should
> > >>>> allow
> > >>>>>> more
> > >>>>>>>>>>> control
> > >>>>>>>>>>>>>> (in case one wanted to pause a portion of the
> > >> instances).
> > >>>> On
> > >>>>>> the
> > >>>>>>>>>> other
> > >>>>>>>>>>>>>> hand, this approach would put more work on the
> > >> implementer
> > >>>> to
> > >>>>>>>>>>> coordinate
> > >>>>>>>>>>>>>> calling pause or resume across instances.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If the other option is more suitable, happy to do that
> > >>>>> instead.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> 3) With EOS, there's a transaction timeout which would
> > >>>>>> determine
> > >>>>>>>> how
> > >>>>>>>>>>>>>> long a
> > >>>>>>>>>>>>>>> transaction can stay idle before it's force-aborted on
> > >> the
> > >>>>>> broker
> > >>>>>>>>>>>>> side. I
> > >>>>>>>>>>>>>>> think when a pause is issued, that means we'd need to
> > >>>>>> immediately
> > >>>>>>>>>>>>> commit
> > >>>>>>>>>>>>>>> the current transaction for EOS since we do not know
> > >> how
> > >>>>> long
> > >>>>>> we
> > >>>>>>>>>> could
> > >>>>>>>>>>>>>>> pause for. Is that right? If yes could you please
> > >> clarify
> > >>>>>> that in
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> doc
> > >>>>>>>>>>>>>>> as well.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Good point.  My intent is for pause() to wait for the
> > >> next
> > >>>>>>>> iteration
> > >>>>>>>>>>>>>> through `runOnce()` and then only skip over the
> > >> processing
> > >>>>> for
> > >>>>>>>> paused
> > >>>>>>>>>>>>> tasks
> > >>>>>>>>>>>>>> in `taskManager.process(numIterations, time)`.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Do commits live inside that call or do they live
> > >>>>>> across/outside of
> > >>>>>>>>>> it?
> > >>>>>>>>>>>>> In
> > >>>>>>>>>>>>>> the former case, I think there shouldn't be any issues
> > >> with
> > >>>>>> EOS.
> > >>>>>>>>>>>>>> Otherwise, we may need to work through some details to
> > >> get
> > >>>>> EOS
> > >>>>>>>> right.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Once we figure that out, I can update the KIP.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, May 4, 2022 at 10:51 AM Jim Hughes
> > >>>>>>>>>>>>> <jhug...@confluent.io.invalid
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I have written up a KIP for adding the ability to
> > >> pause
> > >>>> and
> > >>>>>>>> resume
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> processing of a topology in AK Streams.  The KIP is
> > >> here:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks in advance for your feedback!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> -- Guozhang
> > >>>>>
> > >>>>
> > >>
> > >
> >
>

Reply via email to