Re: [DISCUSS] KIP-834: Pause / Resume KafkaStreams Topologies

Luke Chen Wed, 11 May 2022 00:20:22 -0700

Hi Jim,

Thanks for the KIP. Overall LGTM!


One late question:
Could we run the stream resetter tool (i.e.
kafka-streams-application-reset.sh) during pause state?
I can imagine there's a use case that after pausing for a while, user just
want to continue with the latest offset, and skipping the intermediate
records.

Thank you.
Luke

On Wed, May 11, 2022 at 10:12 AM Jim Hughes <[email protected]>
wrote:

> Hi Matthias,
>
> I like it.  I've updated the KIP to reflect that detail; I put the details
> in the docs for pause.
>
> Cheers,
>
> Jim
>
> On Tue, May 10, 2022 at 7:51 PM Matthias J. Sax <[email protected]> wrote:
>
> > Thanks for the KIP. Overall LGTM.
> >
> > Can we clarify one question: would it be allowed to call `pause()`
> > before calling `start()`? I don't see any reason why we would need to
> > disallow it?
> >
> > It could be helpful to start a KafkaStreams client in paused state --
> > otherwise there is a race between calling `start()` and calling
> `pause()`.
> >
> > If we allow it, we should clearly document it.
> >
> >
> > -Matthias
> >
> > On 5/10/22 12:04 PM, Jim Hughes wrote:
> > > Hi Bill, all,
> > >
> > > Thank you.  I've updated the KIP to reflect pausing standby tasks as
> > well.
> > > I think all the outstanding points have been addressed and I'm going to
> > > start the vote thread!
> > >
> > > Cheers,
> > >
> > > Jim
> > >
> > >
> > >
> > > On Tue, May 10, 2022 at 2:43 PM Bill Bejeck <[email protected]> wrote:
> > >
> > >> Hi Jim,
> > >>
> > >> After reading the comments on the KIP, I agree that it makes sense to
> > pause
> > >> all activities and any changes can be made later on.
> > >>
> > >> Thanks,
> > >> Bill
> > >>
> > >> On Tue, May 10, 2022 at 4:03 AM Bruno Cadonna <[email protected]>
> > wrote:
> > >>
> > >>> Hi Jim,
> > >>>
> > >>> Thanks for the KIP!
> > >>>
> > >>> I am fine with the KIP in general.
> > >>>
> > >>> However, I am with Sophie and John to also pause the standbys for the
> > >>> reasons they brought up. Is there a specific reason you want to keep
> > >>> standbys going? It feels like premature optimization to me. We can
> > still
> > >>> add keeping standby running in a follow up if needed.
> > >>>
> > >>> Best,
> > >>> Bruno
> > >>>
> > >>> On 10.05.22 05:15, Sophie Blee-Goldman wrote:
> > >>>> Thanks Jim, just one note/question on the standby tasks:
> > >>>>
> > >>>> At the minute, my moderately held position is that standby tasks
> ought
> > >> to
> > >>>>> continue reading and remain caught up.  If standby tasks would run
> > out
> > >>> of
> > >>>>> space, there are probably bigger problems.
> > >>>>
> > >>>>
> > >>>> For a single node application, or when the #pause API is invoked on
> > all
> > >>>> instances,
> > >>>> then there won't be any further active processing and thus nothing
> to
> > >>> keep
> > >>>> up with,
> > >>>> right? So for that case, it's just a matter of whether any standbys
> > >> that
> > >>>> are lagging
> > >>>> will have the chance to catch up to the (paused) active task state
> > >> before
> > >>>> they stop
> > >>>> as well, in which case having them continue feels fine to me.
> However
> > >>> this
> > >>>> is a
> > >>>> relatively trivial benefit and I would only consider it as a
> deciding
> > >>>> factor when all
> > >>>> things are equal otherwise.
> > >>>>
> > >>>> My concern is the more interesting case: when this feature is used
> to
> > >>> pause
> > >>>> only
> > >>>> one nodes, or some subset of the overall application. In this case,
> > >> yes,
> > >>>> the standby
> > >>>> tasks will indeed fall out of sync. But the only reason I can
> imagine
> > >>>> someone using
> > >>>> the pause feature in such a way is because there is something going
> > >>> wrong,
> > >>>> or about
> > >>>> to go wrong, on that particular node. For example as mentioned
> above,
> > >> if
> > >>>> the user
> > >>>> wants to cut down on costs without stopping everything, or if the
> node
> > >> is
> > >>>> about to
> > >>>> run out of disk or needs to be debugged or so on. And in this case,
> > >>>> continuing to
> > >>>> process the standby tasks while other instances continue to run
> would
> > >>>> pretty much
> > >>>> defeat the purpose of pausing it entirely, and might have unpleasant
> > >>>> consequences
> > >>>> for the unsuspecting developer.
> > >>>>
> > >>>> All that said, I don't want to block this KIP so if you have strong
> > >>>> feelings about the
> > >>>> standby behavior I'm happy to back down. I'm only pushing back now
> > >>> because
> > >>>> it
> > >>>> felt like there wasn't any particular motivation for the standbys to
> > >>>> continue processing
> > >>>> or not, and I figured I'd try to fill in this gap with my thoughts
> on
> > >> the
> > >>>> matter :)
> > >>>> Either way we should just make sure that this behavior is documented
> > >>>> clearly,
> > >>>> since it may be surprising if we decide to only pause active
> > processing
> > >>>> (another option
> > >>>> is to rename the method something like #pauseProcessing or
> > >>>> #pauseActiveProcessing
> > >>>> so that it's hard to miss).
> > >>>>
> > >>>> Thanks! Sorry for the lengthy response, but hopefully we won't need
> to
> > >>>> debate this any
> > >>>> further. Beyond this I'm satisfied with the latest proposal
> > >>>>
> > >>>> On Mon, May 9, 2022 at 5:16 PM John Roesler <[email protected]>
> > >> wrote:
> > >>>>
> > >>>>> Thanks for the updates, Jim!
> > >>>>>
> > >>>>> After this discussion and your updates, this KIP looks good to me.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> John
> > >>>>>
> > >>>>> On Mon, May 9, 2022, at 17:52, Jim Hughes wrote:
> > >>>>>> Hi Sophie, all,
> > >>>>>>
> > >>>>>> I've updated the KIP with feedback from the discussion so far:
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> > >>>>>>
> > >>>>>> As a terse summary of my current position:
> > >>>>>> Pausing will only stop processing and punctuation (respecting
> > modular
> > >>>>>> topologies).
> > >>>>>> Paused topologies will still a) consume from input topics, b) call
> > >> the
> > >>>>>> usual commit pathways (commits will happen basically as they would
> > >>> have),
> > >>>>>> and c) standBy tasks will still be processed.
> > >>>>>>
> > >>>>>> Shout if the KIP or those details still need some TLC.  Responding
> > to
> > >>>>>> Sophie inline below.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Mon, May 9, 2022 at 6:06 PM Sophie Blee-Goldman
> > >>>>>> <[email protected]> wrote:
> > >>>>>>
> > >>>>>>> Don't worry, I'm going to be adding the APIs for topology-level
> > >>> pausing
> > >>>>> as
> > >>>>>>> part of the modular topologies KIP,
> > >>>>>>> so we don't need to worry about that for now. That said, I don't
> > >> think
> > >>>>> we
> > >>>>>>> should brush it off entirely and design
> > >>>>>>> this feature in a way that's going to be incompatible or hugely
> > >> raise
> > >>>>> the
> > >>>>>>> LOE on bringing the (mostly already
> > >>>>>>> implemented) modular topologies feature into the public API, just
> > >>>>>>> because it "won the race to write a KIP" :)
> > >>>>>>>
> > >>>>>>
> > >>>>>> Yes, I'm hoping that this is all compatible with modular
> > >> topologies.  I
> > >>>>>> haven't seen anything so far which seems to be a problem; this KIP
> > is
> > >>>>> just
> > >>>>>> in a weird state to discuss details of acting on modular
> > >> topologies.:)
> > >>>>>>
> > >>>>>>
> > >>>>>>> I may be biased (ok, I definitely am), but I'm not in favor of
> > >> adding
> > >>>>> this
> > >>>>>>> as a state regardless of the modular topologies.
> > >>>>>>> First of all any change to the KafkaStreams state machine is a
> > >>> breaking
> > >>>>>>> change, no? So we would have to wait until
> > >>>>>>> the next major release which seems like an unnecessary thing to
> > >> block
> > >>>>> on.
> > >>>>>>> (Whether to add this as a state to the
> > >>>>>>> StreamThread's FSM is an implementation detail).
> > >>>>>>>
> > >>>>>>
> > >>>>>> +1.  I am sold on skipping out on new states.  I had that as a
> > >> rejected
> > >>>>>> alternative in the KIP and have added a few more words to that
> bit.
> > >>>>>>
> > >>>>>>
> > >>>>>>> Also, the semantics of using an `isPaused` method to distinguish
> a
> > >>>>> paused
> > >>>>>>> instance (or topology) make more sense
> > >>>>>>> to me -- this is a user-specified status, whereas the
> KafkaStreams
> > >>>>> state is
> > >>>>>>> intended to relay the status of the system
> > >>>>>>> itself. For example, if we are going to continue to poll during
> > >> pause,
> > >>>>> then
> > >>>>>>> shouldn't the client transition to REBALANCING?
> > >>>>>>> I believe it makes sense to still allow distinguishing these
> states
> > >>>>> while a
> > >>>>>>> client is paused, whereas making PAUSED its
> > >>>>>>> own state means you can't tell when the client is rebalancing vs
> > >>>>> running,
> > >>>>>>> or whether it is paused or dead: presumably
> > >>>>>>> the NOT_RUNNING/ERROR state would trump the PAUSED state, which
> > >> means
> > >>>>> you
> > >>>>>>> would not be able to rely on
> > >>>>>>> checking the state to see if you had called PAUSED on that
> > instance.
> > >>>>>>> Obviously you can work around this by just
> > >>>>>>> maintaining a flag in the usercode, but all this feels very
> > >> unnatural
> > >>>>> to me
> > >>>>>>> vs just checking the `#isPaused` API.
> > >>>>>>>
> > >>>>>>> On that note, I had one question -- at what point would the
> > >>> `#isPaused`
> > >>>>>>> check return true? Would it do so immediately
> > >>>>>>> after pausing the instance, or only once it has finished
> committing
> > >>>>> offsets
> > >>>>>>> and stopped returning records?
> > >>>>>>>
> > >>>>>>
> > >>>>>> Immediately, `#isPaused` tells you about metadata.
> > >>>>>>
> > >>>>>>
> > >>>>>>> Finally, on the note of punctuators I think it would make most
> > sense
> > >>> to
> > >>>>>>> either pause these as well or else add this an
> > >>>>>>> an explicit option for the user. If this feature is used to, for
> > >>>>> example,
> > >>>>>>> help save on processing costs while an app is
> > >>>>>>> not in use, then it would probably be surprising and perhaps
> > >> alarming
> > >>> to
> > >>>>>>> see certain kinds of processing still continue.
> > >>>>>>>
> > >>>>>>
> > >>>>>>   From other parts of the discussion, I'm sold on pausing
> > punctuation.
> > >>>>>>
> > >>>>>>
> > >>>>>>> The question of whether to continue fetching for standby tasks is
> > >>> maybe
> > >>>>> a
> > >>>>>>> bit more debatable, as it would certainly be
> > >>>>>>> nice to find your clients all caught up when you go to resume the
> > >>>>> instance
> > >>>>>>> again, but I would still strongly suggest
> > >>>>>>> pausing these as well. To use a similar example, imagine if you
> > >> paused
> > >>>>> an
> > >>>>>>> app because it was about to run out of
> > >>>>>>> disk. If the standbys kept processing and filled up the remaining
> > >>> space,
> > >>>>>>> you'd probably feel a bit betrayed by this API.
> > >>>>>>>
> > >>>>>>> WDYT?
> > >>>>>>>
> > >>>>>>
> > >>>>>> At the minute, my moderately held position is that standby tasks
> > >> ought
> > >>> to
> > >>>>>> continue reading and remain caught up.  If standby tasks would run
> > >> out
> > >>> of
> > >>>>>> space, there are probably bigger problems.
> > >>>>>>
> > >>>>>> If later it is desirable to manage punctuation or standby tasks,
> > then
> > >>> it
> > >>>>>> should be easy for future folks to modify things.
> > >>>>>>
> > >>>>>> Overall, I'd frame this KIP as "pause processing resulting in
> > >> outputs".
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>>
> > >>>>>> Jim
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Mon, May 9, 2022 at 10:33 AM Guozhang Wang <
> [email protected]>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I think for named topology we can leave the scope of this KIP as
> > >> "all
> > >>>>> or
> > >>>>>>>> nothing", i.e. when you pause an instance you pause all of its
> > >>>>>>> topologies.
> > >>>>>>>> I raised this question in my previous email just trying to
> clarify
> > >> if
> > >>>>>>> this
> > >>>>>>>> is what you have in mind. We can leave the question of finer
> > >>>>> controlled
> > >>>>>>>> pausing behavior for later when we have named topology being
> > >> exposed
> > >>>>> via
> > >>>>>>>> another KIP.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Guozhang
> > >>>>>>>>
> > >>>>>>>> On Mon, May 9, 2022 at 7:50 AM John Roesler <
> [email protected]>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Jim,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the replies. This all sounds good to me. Just two
> > >> further
> > >>>>>>>>> comments:
> > >>>>>>>>>
> > >>>>>>>>> 3. It seems like you should aim for the simplest semantics. If
> > the
> > >>>>>>> intent
> > >>>>>>>>> is to “pause” the instance, then you’d better pause the whole
> > >>>>> instance.
> > >>>>>>>> If
> > >>>>>>>>> you leave punctuations and standbys running, I expect we’d see
> > bug
> > >>>>>>>> reports
> > >>>>>>>>> come in that the instance isn’t really paused.
> > >>>>>>>>>
> > >>>>>>>>> 5. Since you won the race to write a KIP, I don’t think it
> makes
> > >> too
> > >>>>>>> much
> > >>>>>>>>> sense to worry too much about modular topologies. When they
> > >> propose
> > >>>>>>> their
> > >>>>>>>>> KIP, they will have to specify a lot of state management
> > behavior,
> > >>>>> and
> > >>>>>>>>> pause/resume will have to be part of it. If they have some
> > concern
> > >>>>>>> about
> > >>>>>>>>> your KIP, they’ll chime in. It doesn’t make sense for you to
> try
> > >> and
> > >>>>>>>> guess
> > >>>>>>>>> what that proposal will look like.
> > >>>>>>>>>
> > >>>>>>>>> To be honest, you’re proposing a KafkaStreams runtime-level
> > >>>>>>> pause/resume
> > >>>>>>>>> function, not a topology-level one anyway, so it seems pretty
> > >> clear
> > >>>>>>> that
> > >>>>>>>> it
> > >>>>>>>>> would pause the whole runtime (of a single instance) regardless
> > of
> > >>>>> any
> > >>>>>>>>> modular topologies. If the intent is to pause individual
> > >> topologies
> > >>>>> in
> > >>>>>>>> the
> > >>>>>>>>> future, you’d need a different API anyway.
> > >>>>>>>>>
> > >>>>>>>>> Thanks!
> > >>>>>>>>> -John
> > >>>>>>>>>
> > >>>>>>>>> On Mon, May 9, 2022, at 08:10, Jim Hughes wrote:
> > >>>>>>>>>> Hi John,
> > >>>>>>>>>>
> > >>>>>>>>>> Long emails are great; responding inline!
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, May 7, 2022 at 4:54 PM John Roesler <
> > [email protected]
> > >>>
> > >>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the KIP, Jim!
> > >>>>>>>>>>>
> > >>>>>>>>>>> This conversation seems to highlight that the KIP needs to
> > >>>>> specify
> > >>>>>>>>>>> some of its behavior as well as its APIs, where the behavior
> is
> > >>>>>>>>>>> observable and significant to users.
> > >>>>>>>>>>>
> > >>>>>>>>>>> For example:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Do you plan to have a guarantee that immediately after
> > >>>>>>>>>>> calling KafkaStreams.pause(), users should observe that the
> > >>>>> instance
> > >>>>>>>>>>> stops processing new records? Or should they expect that the
> > >>>>> threads
> > >>>>>>>>>>> will continue to process some records and pause
> asynchronously
> > >>>>>>>>>>> (you already answered this in the thread earlier)?
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> I'm happy to build up to a guarantee of sorts.  My current
> idea
> > >> is
> > >>>>>>> that
> > >>>>>>>>>> pause() does not do anything "exceptional" to get control back
> > >>>>> from a
> > >>>>>>>>>> running topology.  A currently running topology would get to
> > >>>>> complete
> > >>>>>>>> its
> > >>>>>>>>>> loop.
> > >>>>>>>>>>
> > >>>>>>>>>> Separately, I'm still piecing together how commits work.  By
> > some
> > >>>>>>>>>> mechanism, after a pause, I do agree that the topology needs
> to
> > >>>>>>> commit
> > >>>>>>>>> its
> > >>>>>>>>>> work in some manner.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> 2. Will the threads continue to poll new records until they
> > >>>>>>> naturally
> > >>>>>>>>> fill
> > >>>>>>>>>>> up the task buffers, or will they immediately pause their
> > >>>>> Consumers
> > >>>>>>>>>>> as well?
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Presently, I'm suggesting that consumers would fill up their
> > >>>>> buffers.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> 3. Will threads continue to call (system time) punctuators,
> or
> > >>>>> would
> > >>>>>>>>>>> punctuations also be paused?
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> In my first pass at thinking through this, I left the
> > punctuators
> > >>>>>>>>> running.
> > >>>>>>>>>> To be honest, I'm not sure what they do, so my approach is
> > either
> > >>>>>>> lucky
> > >>>>>>>>> and
> > >>>>>>>>>> correct or it could be Very Clearly Wrong.;)
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> I realize that some of those questions simply may not have
> > >>>>> occurred
> > >>>>>>> to
> > >>>>>>>>>>> you, so this is not a criticism for leaving them off; I'm
> just
> > >>>>>>>> pointing
> > >>>>>>>>> out
> > >>>>>>>>>>> that although we don't tend to mention implementation details
> > in
> > >>>>>>> KIPs,
> > >>>>>>>>>>> we also can't be too high level, since there are a lot of
> > >>>>>>> operational
> > >>>>>>>>>>> details that users rely on to achieve various behaviors in
> > >>>>> Streams.
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Ayup, I will add some details as we iron out the guarantees,
> > >>>>>>>>> implementation
> > >>>>>>>>>> details that are at the API level.  This one is tough since
> > >>>>> internal
> > >>>>>>>>>> features like NamedTopologies are part of the discussion.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> A couple more comments:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 4. +1 to what Guozhang said. It seems like we should we also
> do
> > >> a
> > >>>>>>>> commit
> > >>>>>>>>>>> before entering the paused state. That way, any open
> > >> transactions
> > >>>>>>>> would
> > >>>>>>>>>>> be closed and not have to worry about timing out. Even under
> > >>>>> ALOS,
> > >>>>>>> it
> > >>>>>>>>>>> seems best to go ahead and complete the processing of
> in-flight
> > >>>>>>>> records
> > >>>>>>>>>>> by committing. That way, if anything happens to die while
> it's
> > >>>>>>> paused,
> > >>>>>>>>>>> existing
> > >>>>>>>>>>> work won't have to be repeated. Plus, if there are any
> > >> processors
> > >>>>>>> with
> > >>>>>>>>> side
> > >>>>>>>>>>> effects, users won't have to tolerate weird edge cases where
> a
> > >>>>> pause
> > >>>>>>>>> occurs
> > >>>>>>>>>>> after a processor sees a record, but before the result is
> sent
> > >> to
> > >>>>>>> its
> > >>>>>>>>>>> outputs.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 5. I noticed that you proposed not to add a PAUSED state,
> but I
> > >>>>>>> didn't
> > >>>>>>>>>>> follow
> > >>>>>>>>>>> the rationale. Adding a state seems beneficial for a number
> of
> > >>>>>>>> reasons:
> > >>>>>>>>>>> StreamThreads already use the thread state to determine
> whether
> > >>>>> to
> > >>>>>>>>> process
> > >>>>>>>>>>> or not, so avoiding a new State would just mean adding a
> > >> separate
> > >>>>>>> flag
> > >>>>>>>>> to
> > >>>>>>>>>>> track
> > >>>>>>>>>>> and then checking your new flag in addition to the State in
> the
> > >>>>>>>> thread.
> > >>>>>>>>>>> Also,
> > >>>>>>>>>>> operating Streams applications is a non-trivial task, and
> users
> > >>>>> rely
> > >>>>>>>> on
> > >>>>>>>>>>> the State
> > >>>>>>>>>>> (and transitions) to understand Streams's behavior. Adding a
> > >>>>> PAUSED
> > >>>>>>>>> state
> > >>>>>>>>>>> is an elegant way to communicate to operators what is
> happening
> > >>>>> with
> > >>>>>>>> the
> > >>>>>>>>>>> application. Note that the person digging though logs and
> > >>>>> metrics,
> > >>>>>>>>> trying
> > >>>>>>>>>>> to understand why the application isn't doing anything is
> > >>>>> probably
> > >>>>>>> not
> > >>>>>>>>>>> going
> > >>>>>>>>>>> to be the same person who is calling pause() and resume().
> > Also,
> > >>>>> if
> > >>>>>>>> you
> > >>>>>>>>> add
> > >>>>>>>>>>> a state, you don't need `isPaused()`.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 5b. If you buy the arguments to go ahead and commit as well
> as
> > >>>>> the
> > >>>>>>>>>>> argument to add a State, then I'd also suggest to follow the
> > >>>>>>> existing
> > >>>>>>>>>>> patterns
> > >>>>>>>>>>> for the shutdown states by also adding PAUSING. That
> > >>>>>>>>>>> way, you'll also expose a way to understand that Streams
> > >> received
> > >>>>>>> the
> > >>>>>>>>>>> signal
> > >>>>>>>>>>> to pause, and that it's still processing and committing some
> > >>>>> records
> > >>>>>>>> in
> > >>>>>>>>>>> preparation to enter a PAUSED state. I'm not sure if a
> RESUMING
> > >>>>>>> state
> > >>>>>>>>> would
> > >>>>>>>>>>> also make sense.
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> I hit a tricky bit when thinking through having a PAUSED
> > >>>>> state...  If
> > >>>>>>>> one
> > >>>>>>>>>> is using Named Topologies, and some of them are paused, what
> > >>>>> state is
> > >>>>>>>> the
> > >>>>>>>>>> Streams instance in?  If we can agree on that, things may
> become
> > >>>>>>>>> clear....
> > >>>>>>>>>> I can see two quick ideas:
> > >>>>>>>>>>
> > >>>>>>>>>> 1.  The state is RUNNING and NamedTopologies have some other
> way
> > >>>>> to
> > >>>>>>>>>> indicate state.
> > >>>>>>>>>>
> > >>>>>>>>>> 2.  The state is something messy like PARTIALLY_PAUSED to
> > reflect
> > >>>>>>> that
> > >>>>>>>>> the
> > >>>>>>>>>> instance has something interesting going on.
> > >>>>>>>>>>
> > >>>>>>>>>> When I poked at things initially, I did try out having
> different
> > >>>>>>>> states,
> > >>>>>>>>>> and I readily agree that a PAUSING state may make sense.
> > >>>>> (Especially
> > >>>>>>>> if
> > >>>>>>>>>> there's a need to run commits before transitioning all the way
> > to
> > >>>>>>>>> PAUSED.)
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> And that's all I have to say about that. I hope you don't
> find
> > >> my
> > >>>>>>>>>>> long message offputting. I'm fundamentally in favor of your
> > KIP,
> > >>>>>>>>>>> and I think with a little more explanation in the KIP, and a
> > few
> > >>>>>>>>>>> small tweaks to the proposal, we'll be able to provide good
> > >>>>>>>>>>> ergonomics to our users.
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks!
> > >>>>>>>>>>
> > >>>>>>>>>> Jim
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> -John
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Sat, May 7, 2022, at 00:06, Guozhang Wang wrote:
> > >>>>>>>>>>>> I'm in favor of the "just pausing the instance itself“
> option
> > >>>>> as
> > >>>>>>>>> well. As
> > >>>>>>>>>>>> for EOS, the point is that when the processing is paused, we
> > >>>>> would
> > >>>>>>>> not
> > >>>>>>>>>>>> trigger any `producer.send` during the time, and the
> > >>>>> transaction
> > >>>>>>>>> timeout
> > >>>>>>>>>>> is
> > >>>>>>>>>>>> sort of relying on that behavior, so my point was that it's
> > >>>>>>> probably
> > >>>>>>>>>>> better
> > >>>>>>>>>>>> to also commit the processing before we pause it.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Fri, May 6, 2022 at 6:12 PM Jim Hughes
> > >>>>>>>>> <[email protected]>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Matthias,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Since the only thing which will be paused is processing the
> > >>>>>>>>> topology, I
> > >>>>>>>>>>>>> think we can let commits happen naturally.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Good point about getting the paused state to new members;
> it
> > >>>>> is
> > >>>>>>>>> seeming
> > >>>>>>>>>>>>> like the "building block" approach is a good one to keep
> > >>>>> things
> > >>>>>>>>> simple
> > >>>>>>>>>>> at
> > >>>>>>>>>>>>> first.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, May 6, 2022 at 8:31 PM Matthias J. Sax <
> > >>>>> [email protected]
> > >>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think it's tricky to propagate a pauseAll() via the
> > >>>>> rebalance
> > >>>>>>>>>>>>>> protocol. New members joining the group would need to get
> > >>>>>>> paused,
> > >>>>>>>>> too?
> > >>>>>>>>>>>>>> Could there be weird race conditions with overlapping
> > >>>>>>> pauseAll()
> > >>>>>>>>> and
> > >>>>>>>>>>>>>> resumeAll() calls on different instanced while there could
> > >>>>> be a
> > >>>>>>>>>>> errors /
> > >>>>>>>>>>>>>> network partitions or similar?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I would argue that similar to IQ, we provide the basic
> > >>>>> building
> > >>>>>>>>>>> blocks,
> > >>>>>>>>>>>>>> and leave it the user users to implement cross instance
> > >>>>>>>> management
> > >>>>>>>>>>> for a
> > >>>>>>>>>>>>>> pauseAll() scenario. -- Also, if there is really demand,
> we
> > >>>>> can
> > >>>>>>>>> always
> > >>>>>>>>>>>>>> add pauseAll()/resumeAll() as follow up work.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> About named typologies: I agree to Jim to not include them
> > >>>>> in
> > >>>>>>>> this
> > >>>>>>>>> KIP
> > >>>>>>>>>>>>>> as they are not a public feature yet. If we make named
> > >>>>>>> typologies
> > >>>>>>>>>>>>>> public, the corresponding KIP should extend the
> pause/resume
> > >>>>>>>>> feature
> > >>>>>>>>>>>>>> (ie, APIs) accordingly. Of course, the code can (and
> should)
> > >>>>>>>>> already
> > >>>>>>>>>>> be
> > >>>>>>>>>>>>>> setup to support it to be future proof.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Good call out about commit and EOS -- to simplify it, I
> > >>>>> think
> > >>>>>>> it
> > >>>>>>>>> might
> > >>>>>>>>>>>>>> be good to commit also for the at-least-once case?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 5/6/22 1:05 PM, Jim Hughes wrote:
> > >>>>>>>>>>>>>>> Hi Bill,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Great questions; I'll do my best to reply inline:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Fri, May 6, 2022 at 3:21 PM Bill Bejeck <
> > >>>>>>> [email protected]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jim,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for the KIP.  I have a couple of meta-questions
> as
> > >>>>>>>> well:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 1) Regarding pausing only a subset of running instances,
> > >>>>> I'm
> > >>>>>>>>>>> thinking
> > >>>>>>>>>>>>>> there
> > >>>>>>>>>>>>>>>> may be a use case for pausing all of them.
> > >>>>>>>>>>>>>>>>       Would it make sense to also allow for pausing all
> > >>>>>>>> instances
> > >>>>>>>>> by
> > >>>>>>>>>>>>>> adding a
> > >>>>>>>>>>>>>>>> method `pauseAll()` or something similar?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Honestly, I'm indifferent on this point.  Presently, I
> > >>>>> think
> > >>>>>>>>> what I
> > >>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>> proposed is the minimal change to get the ability to
> pause
> > >>>>>>> and
> > >>>>>>>>>>> resume
> > >>>>>>>>>>>>>>> processing.  If adding a 'pauseAll()' is required, I'd be
> > >>>>>>> happy
> > >>>>>>>>> to
> > >>>>>>>>>>> do
> > >>>>>>>>>>>>>> that!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>    From Guozhang's email, it sounds like this would
> require
> > >>>>>>> using
> > >>>>>>>>> the
> > >>>>>>>>>>>>>>> rebalance protocol to trigger the coordination.  Would
> > >>>>> there
> > >>>>>>> be
> > >>>>>>>>>>> enough
> > >>>>>>>>>>>>>> room
> > >>>>>>>>>>>>>>> in that approach to indicate that a named topology is to
> > >>>>> be
> > >>>>>>>>> paused
> > >>>>>>>>>>>>> across
> > >>>>>>>>>>>>>>> all nodes?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> 2) Would pausing affect standby tasks?  For example,
> > >>>>> imagine
> > >>>>>>>>> there
> > >>>>>>>>>>>>> are 3
> > >>>>>>>>>>>>>>>> instances A, B, and C.
> > >>>>>>>>>>>>>>>>       A user elects to pause instance C only but it
> hosts
> > >>>>> the
> > >>>>>>>>> standby
> > >>>>>>>>>>>>>> tasks
> > >>>>>>>>>>>>>>>> for A.
> > >>>>>>>>>>>>>>>>       Would the standby tasks on the paused application
> > >>>>>>> continue
> > >>>>>>>>> to
> > >>>>>>>>>>> read
> > >>>>>>>>>>>>>> from
> > >>>>>>>>>>>>>>>> the changelog topic?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Yes, standby tasks would continue reading from the
> > >>>>> changelog
> > >>>>>>>>> topic.
> > >>>>>>>>>>>>> All
> > >>>>>>>>>>>>>>> consumers would continue reading to avoid getting dropped
> > >>>>>>> from
> > >>>>>>>>> their
> > >>>>>>>>>>>>>>> consumer groups.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks!
> > >>>>>>>>>>>>>>>> Bill
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Fri, May 6, 2022 at 2:44 PM Jim Hughes
> > >>>>>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi Guozhang,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks for the feedback; responses inline below:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Fri, May 6, 2022 at 1:09 PM Guozhang Wang <
> > >>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Hello Jim,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks for the proposed KIP. I have some meta
> questions
> > >>>>>>>> about
> > >>>>>>>>> it:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 1) Would an instance always pause/resume all of its
> > >>>>>>> current
> > >>>>>>>>> owned
> > >>>>>>>>>>>>>>>>>> topologies (i.e. the named topologies), or are there
> > >>>>> any
> > >>>>>>>>>>> scenarios
> > >>>>>>>>>>>>>>>> where
> > >>>>>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>>> only want to pause/resume a subset of them?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> An instance may wish to pause some of its named
> > >>>>> topologies.
> > >>>>>>>> I
> > >>>>>>>>> was
> > >>>>>>>>>>>>>> unsure
> > >>>>>>>>>>>>>>>>> what to say about named topologies in the KIP since
> they
> > >>>>>>> seem
> > >>>>>>>>> to
> > >>>>>>>>>>> be
> > >>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>> internal detail at the moment.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I intend to add to KafkaStreamsNamedTopologyWrapper
> > >>>>> methods
> > >>>>>>>>> like:
> > >>>>>>>>>>>>>>>>>        public void pauseNamedTopology(final String
> > >>>>>>>>> topologyToPause)
> > >>>>>>>>>>>>>>>>>        public boolean isNamedTopologyPaused(final
> String
> > >>>>>>>>> topology)
> > >>>>>>>>>>>>>>>>>        public void resumeNamedTopology(final String
> > >>>>>>>>>>> topologyToResume)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 2) From a user's perspective, do we want to always
> > >>>>> issue a
> > >>>>>>>>>>>>>>>> `pause/resume`
> > >>>>>>>>>>>>>>>>>> to all the instances or not? For example, we can
> define
> > >>>>>>> the
> > >>>>>>>>>>>>> semantics
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>> the function as "you only need to call this function
> on
> > >>>>>>> any
> > >>>>>>>> of
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> application's instances, and all instances would then
> > >>>>>>> pause
> > >>>>>>>>> (via
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> rebalance error codes)", or as "you would call this
> > >>>>>>> function
> > >>>>>>>>> for
> > >>>>>>>>>>> all
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>> instances of an application". Which one are you
> > >>>>> referring
> > >>>>>>>> to?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> My initial intent is that one would call this function
> > >>>>> on
> > >>>>>>> any
> > >>>>>>>>>>>>> instances
> > >>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>> the application that one wishes to pause.  This should
> > >>>>>>> allow
> > >>>>>>>>> more
> > >>>>>>>>>>>>>> control
> > >>>>>>>>>>>>>>>>> (in case one wanted to pause a portion of the
> > >>>>> instances).
> > >>>>>>> On
> > >>>>>>>>> the
> > >>>>>>>>>>>>> other
> > >>>>>>>>>>>>>>>>> hand, this approach would put more work on the
> > >>>>> implementer
> > >>>>>>> to
> > >>>>>>>>>>>>>> coordinate
> > >>>>>>>>>>>>>>>>> calling pause or resume across instances.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> If the other option is more suitable, happy to do that
> > >>>>>>>> instead.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 3) With EOS, there's a transaction timeout which would
> > >>>>>>>>> determine
> > >>>>>>>>>>> how
> > >>>>>>>>>>>>>>>>> long a
> > >>>>>>>>>>>>>>>>>> transaction can stay idle before it's force-aborted on
> > >>>>> the
> > >>>>>>>>> broker
> > >>>>>>>>>>>>>>>> side. I
> > >>>>>>>>>>>>>>>>>> think when a pause is issued, that means we'd need to
> > >>>>>>>>> immediately
> > >>>>>>>>>>>>>>>> commit
> > >>>>>>>>>>>>>>>>>> the current transaction for EOS since we do not know
> > >>>>> how
> > >>>>>>>> long
> > >>>>>>>>> we
> > >>>>>>>>>>>>> could
> > >>>>>>>>>>>>>>>>>> pause for. Is that right? If yes could you please
> > >>>>> clarify
> > >>>>>>>>> that in
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> doc
> > >>>>>>>>>>>>>>>>>> as well.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Good point.  My intent is for pause() to wait for the
> > >>>>> next
> > >>>>>>>>>>> iteration
> > >>>>>>>>>>>>>>>>> through `runOnce()` and then only skip over the
> > >>>>> processing
> > >>>>>>>> for
> > >>>>>>>>>>> paused
> > >>>>>>>>>>>>>>>> tasks
> > >>>>>>>>>>>>>>>>> in `taskManager.process(numIterations, time)`.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Do commits live inside that call or do they live
> > >>>>>>>>> across/outside of
> > >>>>>>>>>>>>> it?
> > >>>>>>>>>>>>>>>> In
> > >>>>>>>>>>>>>>>>> the former case, I think there shouldn't be any issues
> > >>>>> with
> > >>>>>>>>> EOS.
> > >>>>>>>>>>>>>>>>> Otherwise, we may need to work through some details to
> > >>>>> get
> > >>>>>>>> EOS
> > >>>>>>>>>>> right.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Once we figure that out, I can update the KIP.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Guozhang
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Wed, May 4, 2022 at 10:51 AM Jim Hughes
> > >>>>>>>>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> I have written up a KIP for adding the ability to
> > >>>>> pause
> > >>>>>>> and
> > >>>>>>>>>>> resume
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>> processing of a topology in AK Streams.  The KIP is
> > >>>>> here:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Thanks in advance for your feedback!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Jim
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> --
> > >>>>>>>>>>>> -- Guozhang
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> -- Guozhang
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-834: Pause / Resume KafkaStreams Topologies

Reply via email to