Re: [DISCUSS] KIP-291: Have separate queues for control requests and data requests

Jun Rao Thu, 19 Jul 2018 17:08:27 -0700

I agree that there is no strong ordering when there are more than one
socket connections. Currently, we rely on controllerEpoch and leaderEpoch
to ensure that the receiving broker picks up the latest state for each
partition.


One potential issue with the dequeue approach is that if the queue is full,
there is no guarantee that the controller requests will be enqueued quickly.

Thanks,

Jun

On Fri, Jul 20, 2018 at 5:25 AM, Mayuresh Gharat <[email protected]
> wrote:

> Yea, the correlationId is only set to 0 in the NetworkClient constructor.
> Since we reuse the same NetworkClient between Controller and the broker, a
> disconnection should not cause it to reset to 0, in which case it can be
> used to reject obsolete requests.
>
> Thanks,
>
> Mayuresh
>
> On Thu, Jul 19, 2018 at 1:52 PM Lucas Wang <[email protected]> wrote:
>
> > @Dong,
> > Great example and explanation, thanks!
> >
> > @All
> > Regarding the example given by Dong, it seems even if we use a queue,
> and a
> > dedicated controller request handling thread,
> > the same result can still happen because R1_a will be sent on one
> > connection, and R1_b & R2 will be sent on a different connection,
> > and there is no ordering between different connections on the broker
> side.
> > I was discussing with Mayuresh offline, and it seems correlation id
> within
> > the same NetworkClient object is monotonically increasing and never
> reset,
> > hence a broker can leverage that to properly reject obsolete requests.
> > Thoughts?
> >
> > Thanks,
> > Lucas
> >
> > On Thu, Jul 19, 2018 at 12:11 PM, Mayuresh Gharat <
> > [email protected]> wrote:
> >
> > > Actually nvm, correlationId is reset in case of connection loss, I
> think.
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Thu, Jul 19, 2018 at 11:11 AM Mayuresh Gharat <
> > > [email protected]>
> > > wrote:
> > >
> > > > I agree with Dong that out-of-order processing can happen with
> having 2
> > > > separate queues as well and it can even happen today.
> > > > Can we use the correlationId in the request from the controller to
> the
> > > > broker to handle ordering ?
> > > >
> > > > Thanks,
> > > >
> > > > Mayuresh
> > > >
> > > >
> > > > On Thu, Jul 19, 2018 at 6:41 AM Becket Qin <[email protected]>
> > wrote:
> > > >
> > > >> Good point, Joel. I agree that a dedicated controller request
> handling
> > > >> thread would be a better isolation. It also solves the reordering
> > issue.
> > > >>
> > > >> On Thu, Jul 19, 2018 at 2:23 PM, Joel Koshy <[email protected]>
> > > wrote:
> > > >>
> > > >> > Good example. I think this scenario can occur in the current code
> as
> > > >> well
> > > >> > but with even lower probability given that there are other
> > > >> non-controller
> > > >> > requests interleaved. It is still sketchy though and I think a
> safer
> > > >> > approach would be separate queues and pinning controller request
> > > >> handling
> > > >> > to one handler thread.
> > > >> >
> > > >> > On Wed, Jul 18, 2018 at 11:12 PM, Dong Lin <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > Hey Becket,
> > > >> > >
> > > >> > > I think you are right that there may be out-of-order processing.
> > > >> However,
> > > >> > > it seems that out-of-order processing may also happen even if we
> > > use a
> > > >> > > separate queue.
> > > >> > >
> > > >> > > Here is the example:
> > > >> > >
> > > >> > > - Controller sends R1 and got disconnected before receiving
> > > response.
> > > >> > Then
> > > >> > > it reconnects and sends R2. Both requests now stay in the
> > controller
> > > >> > > request queue in the order they are sent.
> > > >> > > - thread1 takes R1_a from the request queue and then thread2
> takes
> > > R2
> > > >> > from
> > > >> > > the request queue almost at the same time.
> > > >> > > - So R1_a and R2 are processed in parallel. There is chance that
> > > R2's
> > > >> > > processing is completed before R1.
> > > >> > >
> > > >> > > If out-of-order processing can happen for both approaches with
> > very
> > > >> low
> > > >> > > probability, it may not be worthwhile to add the extra queue.
> What
> > > do
> > > >> you
> > > >> > > think?
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Dong
> > > >> > >
> > > >> > >
> > > >> > > On Wed, Jul 18, 2018 at 6:17 PM, Becket Qin <
> [email protected]
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hi Mayuresh/Joel,
> > > >> > > >
> > > >> > > > Using the request channel as a dequeue was bright up some time
> > ago
> > > >> when
> > > >> > > we
> > > >> > > > initially thinking of prioritizing the request. The concern
> was
> > > that
> > > >> > the
> > > >> > > > controller requests are supposed to be processed in order. If
> we
> > > can
> > > >> > > ensure
> > > >> > > > that there is one controller request in the request channel,
> the
> > > >> order
> > > >> > is
> > > >> > > > not a concern. But in cases that there are more than one
> > > controller
> > > >> > > request
> > > >> > > > inserted into the queue, the controller request order may
> change
> > > and
> > > >> > > cause
> > > >> > > > problem. For example, think about the following sequence:
> > > >> > > > 1. Controller successfully sent a request R1 to broker
> > > >> > > > 2. Broker receives R1 and put the request to the head of the
> > > request
> > > >> > > queue.
> > > >> > > > 3. Controller to broker connection failed and the controller
> > > >> > reconnected
> > > >> > > to
> > > >> > > > the broker.
> > > >> > > > 4. Controller sends a request R2 to the broker
> > > >> > > > 5. Broker receives R2 and add it to the head of the request
> > queue.
> > > >> > > > Now on the broker side, R2 will be processed before R1 is
> > > processed,
> > > >> > > which
> > > >> > > > may cause problem.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jiangjie (Becket) Qin
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Thu, Jul 19, 2018 at 3:23 AM, Joel Koshy <
> > [email protected]>
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > @Mayuresh - I like your idea. It appears to be a simpler
> less
> > > >> > invasive
> > > >> > > > > alternative and it should work. Jun/Becket/others, do you
> see
> > > any
> > > >> > > > pitfalls
> > > >> > > > > with this approach?
> > > >> > > > >
> > > >> > > > > On Wed, Jul 18, 2018 at 12:03 PM, Lucas Wang <
> > > >> [email protected]>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > @Mayuresh,
> > > >> > > > > > That's a very interesting idea that I haven't thought
> > before.
> > > >> > > > > > It seems to solve our problem at hand pretty well, and
> also
> > > >> > > > > > avoids the need to have a new size metric and capacity
> > config
> > > >> > > > > > for the controller request queue. In fact, if we were to
> > adopt
> > > >> > > > > > this design, there is no public interface change, and we
> > > >> > > > > > probably don't need a KIP.
> > > >> > > > > > Also implementation wise, it seems
> > > >> > > > > > the java class LinkedBlockingQueue can readily satisfy the
> > > >> > > requirement
> > > >> > > > > > by supporting a capacity, and also allowing inserting at
> > both
> > > >> ends.
> > > >> > > > > >
> > > >> > > > > > My only concern is that this design is tied to the
> > coincidence
> > > >> that
> > > >> > > > > > we have two request priorities and there are two ends to a
> > > >> deque.
> > > >> > > > > > Hence by using the proposed design, it seems the network
> > layer
> > > >> is
> > > >> > > > > > more tightly coupled with upper layer logic, e.g. if we
> were
> > > to
> > > >> add
> > > >> > > > > > an extra priority level in the future for some reason, we
> > > would
> > > >> > > > probably
> > > >> > > > > > need to go back to the design of separate queues, one for
> > each
> > > >> > > priority
> > > >> > > > > > level.
> > > >> > > > > >
> > > >> > > > > > In summary, I'm ok with both designs and lean toward your
> > > >> suggested
> > > >> > > > > > approach.
> > > >> > > > > > Let's hear what others think.
> > > >> > > > > >
> > > >> > > > > > @Becket,
> > > >> > > > > > In light of Mayuresh's suggested new design, I'm answering
> > > your
> > > >> > > > question
> > > >> > > > > > only in the context
> > > >> > > > > > of the current KIP design: I think your suggestion makes
> > > sense,
> > > >> and
> > > >> > > I'm
> > > >> > > > > ok
> > > >> > > > > > with removing the capacity config and
> > > >> > > > > > just relying on the default value of 20 being sufficient
> > > enough.
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > > Lucas
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Wed, Jul 18, 2018 at 9:57 AM, Mayuresh Gharat <
> > > >> > > > > > [email protected]
> > > >> > > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Lucas,
> > > >> > > > > > >
> > > >> > > > > > > Seems like the main intent here is to prioritize the
> > > >> controller
> > > >> > > > request
> > > >> > > > > > > over any other requests.
> > > >> > > > > > > In that case, we can change the request queue to a
> > dequeue,
> > > >> where
> > > >> > > you
> > > >> > > > > > > always insert the normal requests (produce,
> consume,..etc)
> > > to
> > > >> the
> > > >> > > end
> > > >> > > > > of
> > > >> > > > > > > the dequeue, but if its a controller request, you insert
> > it
> > > to
> > > >> > the
> > > >> > > > head
> > > >> > > > > > of
> > > >> > > > > > > the queue. This ensures that the controller request will
> > be
> > > >> given
> > > >> > > > > higher
> > > >> > > > > > > priority over other requests.
> > > >> > > > > > >
> > > >> > > > > > > Also since we only read one request from the socket and
> > mute
> > > >> it
> > > >> > and
> > > >> > > > > only
> > > >> > > > > > > unmute it after handling the request, this would ensure
> > that
> > > >> we
> > > >> > > don't
> > > >> > > > > > > handle controller requests out of order.
> > > >> > > > > > >
> > > >> > > > > > > With this approach we can avoid the second queue and the
> > > >> > additional
> > > >> > > > > > config
> > > >> > > > > > > for the size of the queue.
> > > >> > > > > > >
> > > >> > > > > > > What do you think ?
> > > >> > > > > > >
> > > >> > > > > > > Thanks,
> > > >> > > > > > >
> > > >> > > > > > > Mayuresh
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Jul 18, 2018 at 3:05 AM Becket Qin <
> > > >> [email protected]
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hey Joel,
> > > >> > > > > > > >
> > > >> > > > > > > > Thank for the detail explanation. I agree the current
> > > design
> > > >> > > makes
> > > >> > > > > > sense.
> > > >> > > > > > > > My confusion is about whether the new config for the
> > > >> controller
> > > >> > > > queue
> > > >> > > > > > > > capacity is necessary. I cannot think of a case in
> which
> > > >> users
> > > >> > > > would
> > > >> > > > > > > change
> > > >> > > > > > > > it.
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks,
> > > >> > > > > > > >
> > > >> > > > > > > > Jiangjie (Becket) Qin
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Jul 18, 2018 at 6:00 PM, Becket Qin <
> > > >> > > [email protected]>
> > > >> > > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi Lucas,
> > > >> > > > > > > > >
> > > >> > > > > > > > > I guess my question can be rephrased to "do we
> expect
> > > >> user to
> > > >> > > > ever
> > > >> > > > > > > change
> > > >> > > > > > > > > the controller request queue capacity"? If we agree
> > that
> > > >> 20
> > > >> > is
> > > >> > > > > > already
> > > >> > > > > > > a
> > > >> > > > > > > > > very generous default number and we do not expect
> user
> > > to
> > > >> > > change
> > > >> > > > > it,
> > > >> > > > > > is
> > > >> > > > > > > > it
> > > >> > > > > > > > > still necessary to expose this as a config?
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Jiangjie (Becket) Qin
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Wed, Jul 18, 2018 at 2:29 AM, Lucas Wang <
> > > >> > > > [email protected]
> > > >> > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > >> @Becket
> > > >> > > > > > > > >> 1. Thanks for the comment. You are right that
> > normally
> > > >> there
> > > >> > > > > should
> > > >> > > > > > be
> > > >> > > > > > > > >> just
> > > >> > > > > > > > >> one controller request because of muting,
> > > >> > > > > > > > >> and I had NOT intended to say there would be many
> > > >> enqueued
> > > >> > > > > > controller
> > > >> > > > > > > > >> requests.
> > > >> > > > > > > > >> I went through the KIP again, and I'm not sure
> which
> > > part
> > > >> > > > conveys
> > > >> > > > > > that
> > > >> > > > > > > > >> info.
> > > >> > > > > > > > >> I'd be happy to revise if you point it out the
> > section.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> 2. Though it should not happen in normal
> conditions,
> > > the
> > > >> > > current
> > > >> > > > > > > design
> > > >> > > > > > > > >> does not preclude multiple controllers running
> > > >> > > > > > > > >> at the same time, hence if we don't have the
> > controller
> > > >> > queue
> > > >> > > > > > capacity
> > > >> > > > > > > > >> config and simply make its capacity to be 1,
> > > >> > > > > > > > >> network threads handling requests from different
> > > >> controllers
> > > >> > > > will
> > > >> > > > > be
> > > >> > > > > > > > >> blocked during those troublesome times,
> > > >> > > > > > > > >> which is probably not what we want. On the other
> > hand,
> > > >> > adding
> > > >> > > > the
> > > >> > > > > > > extra
> > > >> > > > > > > > >> config with a default value, say 20, guards us from
> > > >> issues
> > > >> > in
> > > >> > > > > those
> > > >> > > > > > > > >> troublesome times, and IMO there isn't much
> downside
> > of
> > > >> > adding
> > > >> > > > the
> > > >> > > > > > > extra
> > > >> > > > > > > > >> config.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> @Mayuresh
> > > >> > > > > > > > >> Good catch, this sentence is an obsolete statement
> > > based
> > > >> on
> > > >> > a
> > > >> > > > > > previous
> > > >> > > > > > > > >> design. I've revised the wording in the KIP.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Thanks,
> > > >> > > > > > > > >> Lucas
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> On Tue, Jul 17, 2018 at 10:33 AM, Mayuresh Gharat <
> > > >> > > > > > > > >> [email protected]> wrote:
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> > Hi Lucas,
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Thanks for the KIP.
> > > >> > > > > > > > >> > I am trying to understand why you think "The
> memory
> > > >> > > > consumption
> > > >> > > > > > can
> > > >> > > > > > > > rise
> > > >> > > > > > > > >> > given the total number of queued requests can go
> up
> > > to
> > > >> 2x"
> > > >> > > in
> > > >> > > > > the
> > > >> > > > > > > > impact
> > > >> > > > > > > > >> > section. Normally the requests from controller
> to a
> > > >> Broker
> > > >> > > are
> > > >> > > > > not
> > > >> > > > > > > > high
> > > >> > > > > > > > >> > volume, right ?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Thanks,
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Mayuresh
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > On Tue, Jul 17, 2018 at 5:06 AM Becket Qin <
> > > >> > > > > [email protected]>
> > > >> > > > > > > > >> wrote:
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > Thanks for the KIP, Lucas. Separating the
> control
> > > >> plane
> > > >> > > from
> > > >> > > > > the
> > > >> > > > > > > > data
> > > >> > > > > > > > >> > plane
> > > >> > > > > > > > >> > > makes a lot of sense.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > In the KIP you mentioned that the controller
> > > request
> > > >> > queue
> > > >> > > > may
> > > >> > > > > > > have
> > > >> > > > > > > > >> many
> > > >> > > > > > > > >> > > requests in it. Will this be a common case? The
> > > >> > controller
> > > >> > > > > > > requests
> > > >> > > > > > > > >> still
> > > >> > > > > > > > >> > > goes through the SocketServer. The SocketServer
> > > will
> > > >> > mute
> > > >> > > > the
> > > >> > > > > > > > channel
> > > >> > > > > > > > >> > once
> > > >> > > > > > > > >> > > a request is read and put into the request
> > channel.
> > > >> So
> > > >> > > > > assuming
> > > >> > > > > > > > there
> > > >> > > > > > > > >> is
> > > >> > > > > > > > >> > > only one connection between controller and each
> > > >> broker,
> > > >> > on
> > > >> > > > the
> > > >> > > > > > > > broker
> > > >> > > > > > > > >> > side,
> > > >> > > > > > > > >> > > there should be only one controller request in
> > the
> > > >> > > > controller
> > > >> > > > > > > > request
> > > >> > > > > > > > >> > queue
> > > >> > > > > > > > >> > > at any given time. If that is the case, do we
> > need
> > > a
> > > >> > > > separate
> > > >> > > > > > > > >> controller
> > > >> > > > > > > > >> > > request queue capacity config? The default
> value
> > 20
> > > >> > means
> > > >> > > > that
> > > >> > > > > > we
> > > >> > > > > > > > >> expect
> > > >> > > > > > > > >> > > there are 20 controller switches to happen in a
> > > short
> > > >> > > period
> > > >> > > > > of
> > > >> > > > > > > > time.
> > > >> > > > > > > > >> I
> > > >> > > > > > > > >> > am
> > > >> > > > > > > > >> > > not sure whether someone should increase the
> > > >> controller
> > > >> > > > > request
> > > >> > > > > > > > queue
> > > >> > > > > > > > >> > > capacity to handle such case, as it seems
> > > indicating
> > > >> > > > something
> > > >> > > > > > > very
> > > >> > > > > > > > >> wrong
> > > >> > > > > > > > >> > > has happened.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Thanks,
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Jiangjie (Becket) Qin
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > On Fri, Jul 13, 2018 at 1:10 PM, Dong Lin <
> > > >> > > > > [email protected]>
> > > >> > > > > > > > >> wrote:
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > Thanks for the update Lucas.
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > > I think the motivation section is intuitive.
> It
> > > >> will
> > > >> > be
> > > >> > > > good
> > > >> > > > > > to
> > > >> > > > > > > > >> learn
> > > >> > > > > > > > >> > > more
> > > >> > > > > > > > >> > > > about the comments from other reviewers.
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > > On Thu, Jul 12, 2018 at 9:48 PM, Lucas Wang <
> > > >> > > > > > > > [email protected]>
> > > >> > > > > > > > >> > > wrote:
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > > > Hi Dong,
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > I've updated the motivation section of the
> > KIP
> > > by
> > > >> > > > > explaining
> > > >> > > > > > > the
> > > >> > > > > > > > >> > cases
> > > >> > > > > > > > >> > > > that
> > > >> > > > > > > > >> > > > > would have user impacts.
> > > >> > > > > > > > >> > > > > Please take a look at let me know your
> > > comments.
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > Thanks,
> > > >> > > > > > > > >> > > > > Lucas
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > On Mon, Jul 9, 2018 at 5:53 PM, Lucas Wang
> <
> > > >> > > > > > > > [email protected]
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > > wrote:
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > > Hi Dong,
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > The simulation of disk being slow is
> merely
> > > >> for me
> > > >> > > to
> > > >> > > > > > easily
> > > >> > > > > > > > >> > > construct
> > > >> > > > > > > > >> > > > a
> > > >> > > > > > > > >> > > > > > testing scenario
> > > >> > > > > > > > >> > > > > > with a backlog of produce requests. In
> > > >> production,
> > > >> > > > other
> > > >> > > > > > > than
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > disk
> > > >> > > > > > > > >> > > > > > being slow, a backlog of
> > > >> > > > > > > > >> > > > > > produce requests may also be caused by
> high
> > > >> > produce
> > > >> > > > QPS.
> > > >> > > > > > > > >> > > > > > In that case, we may not want to kill the
> > > >> broker
> > > >> > and
> > > >> > > > > > that's
> > > >> > > > > > > > when
> > > >> > > > > > > > >> > this
> > > >> > > > > > > > >> > > > KIP
> > > >> > > > > > > > >> > > > > > can be useful, both for JBOD
> > > >> > > > > > > > >> > > > > > and non-JBOD setup.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > Going back to your previous question
> about
> > > each
> > > >> > > > > > > ProduceRequest
> > > >> > > > > > > > >> > > covering
> > > >> > > > > > > > >> > > > > 20
> > > >> > > > > > > > >> > > > > > partitions that are randomly
> > > >> > > > > > > > >> > > > > > distributed, let's say a LeaderAndIsr
> > request
> > > >> is
> > > >> > > > > enqueued
> > > >> > > > > > > that
> > > >> > > > > > > > >> > tries
> > > >> > > > > > > > >> > > to
> > > >> > > > > > > > >> > > > > > switch the current broker, say broker0,
> > from
> > > >> > leader
> > > >> > > to
> > > >> > > > > > > > follower
> > > >> > > > > > > > >> > > > > > *for one of the partitions*, say
> *test-0*.
> > > For
> > > >> the
> > > >> > > > sake
> > > >> > > > > of
> > > >> > > > > > > > >> > argument,
> > > >> > > > > > > > >> > > > > > let's also assume the other brokers, say
> > > >> broker1,
> > > >> > > have
> > > >> > > > > > > > *stopped*
> > > >> > > > > > > > >> > > > fetching
> > > >> > > > > > > > >> > > > > > from
> > > >> > > > > > > > >> > > > > > the current broker, i.e. broker0.
> > > >> > > > > > > > >> > > > > > 1. If the enqueued produce requests have
> > > acks =
> > > >> > -1
> > > >> > > > > (ALL)
> > > >> > > > > > > > >> > > > > >   1.1 without this KIP, the
> ProduceRequests
> > > >> ahead
> > > >> > of
> > > >> > > > > > > > >> LeaderAndISR
> > > >> > > > > > > > >> > > will
> > > >> > > > > > > > >> > > > be
> > > >> > > > > > > > >> > > > > > put into the purgatory,
> > > >> > > > > > > > >> > > > > >         and since they'll never be
> > replicated
> > > >> to
> > > >> > > other
> > > >> > > > > > > brokers
> > > >> > > > > > > > >> > > (because
> > > >> > > > > > > > >> > > > > of
> > > >> > > > > > > > >> > > > > > the assumption made above), they will
> > > >> > > > > > > > >> > > > > >         be completed either when the
> > > >> LeaderAndISR
> > > >> > > > > request
> > > >> > > > > > is
> > > >> > > > > > > > >> > > processed
> > > >> > > > > > > > >> > > > or
> > > >> > > > > > > > >> > > > > > when the timeout happens.
> > > >> > > > > > > > >> > > > > >   1.2 With this KIP, broker0 will
> > immediately
> > > >> > > > transition
> > > >> > > > > > the
> > > >> > > > > > > > >> > > partition
> > > >> > > > > > > > >> > > > > > test-0 to become a follower,
> > > >> > > > > > > > >> > > > > >         after the current broker sees the
> > > >> > > replication
> > > >> > > > of
> > > >> > > > > > the
> > > >> > > > > > > > >> > > remaining
> > > >> > > > > > > > >> > > > 19
> > > >> > > > > > > > >> > > > > > partitions, it can send a response
> > indicating
> > > >> that
> > > >> > > > > > > > >> > > > > >         it's no longer the leader for the
> > > >> > "test-0".
> > > >> > > > > > > > >> > > > > >   To see the latency difference between
> 1.1
> > > and
> > > >> > 1.2,
> > > >> > > > > let's
> > > >> > > > > > > say
> > > >> > > > > > > > >> > there
> > > >> > > > > > > > >> > > > are
> > > >> > > > > > > > >> > > > > > 24K produce requests ahead of the
> > > LeaderAndISR,
> > > >> > and
> > > >> > > > > there
> > > >> > > > > > > are
> > > >> > > > > > > > 8
> > > >> > > > > > > > >> io
> > > >> > > > > > > > >> > > > > threads,
> > > >> > > > > > > > >> > > > > >   so each io thread will process
> > > approximately
> > > >> > 3000
> > > >> > > > > > produce
> > > >> > > > > > > > >> > requests.
> > > >> > > > > > > > >> > > > Now
> > > >> > > > > > > > >> > > > > > let's investigate the io thread that
> > finally
> > > >> > > processed
> > > >> > > > > the
> > > >> > > > > > > > >> > > > LeaderAndISR.
> > > >> > > > > > > > >> > > > > >   For the 3000 produce requests, if we
> > model
> > > >> the
> > > >> > > time
> > > >> > > > > when
> > > >> > > > > > > > their
> > > >> > > > > > > > >> > > > > remaining
> > > >> > > > > > > > >> > > > > > 19 partitions catch up as t0, t1,
> ...t2999,
> > > and
> > > >> > the
> > > >> > > > > > > > LeaderAndISR
> > > >> > > > > > > > >> > > > request
> > > >> > > > > > > > >> > > > > is
> > > >> > > > > > > > >> > > > > > processed at time t3000.
> > > >> > > > > > > > >> > > > > >   Without this KIP, the 1st produce
> request
> > > >> would
> > > >> > > have
> > > >> > > > > > > waited
> > > >> > > > > > > > an
> > > >> > > > > > > > >> > > extra
> > > >> > > > > > > > >> > > > > > t3000 - t0 time in the purgatory, the 2nd
> > an
> > > >> extra
> > > >> > > > time
> > > >> > > > > of
> > > >> > > > > > > > >> t3000 -
> > > >> > > > > > > > >> > > t1,
> > > >> > > > > > > > >> > > > > etc.
> > > >> > > > > > > > >> > > > > >   Roughly speaking, the latency
> difference
> > is
> > > >> > bigger
> > > >> > > > for
> > > >> > > > > > the
> > > >> > > > > > > > >> > earlier
> > > >> > > > > > > > >> > > > > > produce requests than for the later ones.
> > For
> > > >> the
> > > >> > > same
> > > >> > > > > > > reason,
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > more
> > > >> > > > > > > > >> > > > > > ProduceRequests queued
> > > >> > > > > > > > >> > > > > >   before the LeaderAndISR, the bigger
> > benefit
> > > >> we
> > > >> > get
> > > >> > > > > > (capped
> > > >> > > > > > > > by
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > > > produce timeout).
> > > >> > > > > > > > >> > > > > > 2. If the enqueued produce requests have
> > > >> acks=0 or
> > > >> > > > > acks=1
> > > >> > > > > > > > >> > > > > >   There will be no latency differences in
> > > this
> > > >> > case,
> > > >> > > > but
> > > >> > > > > > > > >> > > > > >   2.1 without this KIP, the records of
> > > >> partition
> > > >> > > > test-0
> > > >> > > > > in
> > > >> > > > > > > the
> > > >> > > > > > > > >> > > > > > ProduceRequests ahead of the LeaderAndISR
> > > will
> > > >> be
> > > >> > > > > appended
> > > >> > > > > > > to
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > local
> > > >> > > > > > > > >> > > > > log,
> > > >> > > > > > > > >> > > > > >         and eventually be truncated after
> > > >> > processing
> > > >> > > > the
> > > >> > > > > > > > >> > > LeaderAndISR.
> > > >> > > > > > > > >> > > > > > This is what's referred to as
> > > >> > > > > > > > >> > > > > >         "some unofficial definition of
> data
> > > >> loss
> > > >> > in
> > > >> > > > > terms
> > > >> > > > > > of
> > > >> > > > > > > > >> > messages
> > > >> > > > > > > > >> > > > > > beyond the high watermark".
> > > >> > > > > > > > >> > > > > >   2.2 with this KIP, we can mitigate the
> > > effect
> > > >> > > since
> > > >> > > > if
> > > >> > > > > > the
> > > >> > > > > > > > >> > > > LeaderAndISR
> > > >> > > > > > > > >> > > > > > is immediately processed, the response to
> > > >> > producers
> > > >> > > > will
> > > >> > > > > > > have
> > > >> > > > > > > > >> > > > > >         the NotLeaderForPartition error,
> > > >> causing
> > > >> > > > > producers
> > > >> > > > > > > to
> > > >> > > > > > > > >> retry
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > This explanation above is the benefit for
> > > >> reducing
> > > >> > > the
> > > >> > > > > > > latency
> > > >> > > > > > > > >> of a
> > > >> > > > > > > > >> > > > > broker
> > > >> > > > > > > > >> > > > > > becoming the follower,
> > > >> > > > > > > > >> > > > > > closely related is reducing the latency
> of
> > a
> > > >> > broker
> > > >> > > > > > becoming
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > > leader.
> > > >> > > > > > > > >> > > > > > In this case, the benefit is even more
> > > >> obvious, if
> > > >> > > > other
> > > >> > > > > > > > brokers
> > > >> > > > > > > > >> > have
> > > >> > > > > > > > >> > > > > > resigned leadership, and the
> > > >> > > > > > > > >> > > > > > current broker should take leadership.
> Any
> > > >> delay
> > > >> > in
> > > >> > > > > > > processing
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > > > LeaderAndISR will be perceived
> > > >> > > > > > > > >> > > > > > by clients as unavailability. In extreme
> > > cases,
> > > >> > this
> > > >> > > > can
> > > >> > > > > > > cause
> > > >> > > > > > > > >> > failed
> > > >> > > > > > > > >> > > > > > produce requests if the retries are
> > > >> > > > > > > > >> > > > > > exhausted.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > Another two types of controller requests
> > are
> > > >> > > > > > UpdateMetadata
> > > >> > > > > > > > and
> > > >> > > > > > > > >> > > > > > StopReplica, which I'll briefly discuss
> as
> > > >> > follows:
> > > >> > > > > > > > >> > > > > > For UpdateMetadata requests, delayed
> > > processing
> > > >> > > means
> > > >> > > > > > > clients
> > > >> > > > > > > > >> > > receiving
> > > >> > > > > > > > >> > > > > > stale metadata, e.g. with the wrong
> > > leadership
> > > >> > info
> > > >> > > > > > > > >> > > > > > for certain partitions, and the effect is
> > > more
> > > >> > > retries
> > > >> > > > > or
> > > >> > > > > > > even
> > > >> > > > > > > > >> > fatal
> > > >> > > > > > > > >> > > > > > failure if the retries are exhausted.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > For StopReplica requests, a long queuing
> > time
> > > >> may
> > > >> > > > > degrade
> > > >> > > > > > > the
> > > >> > > > > > > > >> > > > performance
> > > >> > > > > > > > >> > > > > > of topic deletion.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > Regarding your last question of the delay
> > for
> > > >> > > > > > > > >> > DescribeLogDirsRequest,
> > > >> > > > > > > > >> > > > you
> > > >> > > > > > > > >> > > > > > are right
> > > >> > > > > > > > >> > > > > > that this KIP cannot help with the
> latency
> > in
> > > >> > > getting
> > > >> > > > > the
> > > >> > > > > > > log
> > > >> > > > > > > > >> dirs
> > > >> > > > > > > > >> > > > info,
> > > >> > > > > > > > >> > > > > > and it's only relevant
> > > >> > > > > > > > >> > > > > > when controller requests are involved.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > Regards,
> > > >> > > > > > > > >> > > > > > Lucas
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > On Tue, Jul 3, 2018 at 5:11 PM, Dong Lin
> <
> > > >> > > > > > > [email protected]
> > > >> > > > > > > > >
> > > >> > > > > > > > >> > > wrote:
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >> Hey Jun,
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> Thanks much for the comments. It is good
> > > >> point.
> > > >> > So
> > > >> > > > the
> > > >> > > > > > > > feature
> > > >> > > > > > > > >> may
> > > >> > > > > > > > >> > > be
> > > >> > > > > > > > >> > > > > >> useful for JBOD use-case. I have one
> > > question
> > > >> > > below.
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> Hey Lucas,
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> Do you think this feature is also useful
> > for
> > > >> > > non-JBOD
> > > >> > > > > > setup
> > > >> > > > > > > > or
> > > >> > > > > > > > >> it
> > > >> > > > > > > > >> > is
> > > >> > > > > > > > >> > > > > only
> > > >> > > > > > > > >> > > > > >> useful for the JBOD setup? It may be
> > useful
> > > to
> > > >> > > > > understand
> > > >> > > > > > > > this.
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> When the broker is setup using JBOD, in
> > > order
> > > >> to
> > > >> > > move
> > > >> > > > > > > leaders
> > > >> > > > > > > > >> on
> > > >> > > > > > > > >> > the
> > > >> > > > > > > > >> > > > > >> failed
> > > >> > > > > > > > >> > > > > >> disk to other disks, the system operator
> > > first
> > > >> > > needs
> > > >> > > > to
> > > >> > > > > > get
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > list
> > > >> > > > > > > > >> > > > of
> > > >> > > > > > > > >> > > > > >> partitions on the failed disk. This is
> > > >> currently
> > > >> > > > > achieved
> > > >> > > > > > > > using
> > > >> > > > > > > > >> > > > > >> AdminClient.describeLogDirs(), which
> sends
> > > >> > > > > > > > >> DescribeLogDirsRequest
> > > >> > > > > > > > >> > to
> > > >> > > > > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> broker. If we only prioritize the
> > controller
> > > >> > > > requests,
> > > >> > > > > > then
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > > > >> DescribeLogDirsRequest
> > > >> > > > > > > > >> > > > > >> may still take a long time to be
> processed
> > > by
> > > >> the
> > > >> > > > > broker.
> > > >> > > > > > > So
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > overall
> > > >> > > > > > > > >> > > > > >> time to move leaders away from the
> failed
> > > disk
> > > >> > may
> > > >> > > > > still
> > > >> > > > > > be
> > > >> > > > > > > > >> long
> > > >> > > > > > > > >> > > even
> > > >> > > > > > > > >> > > > > with
> > > >> > > > > > > > >> > > > > >> this KIP. What do you think?
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> Thanks,
> > > >> > > > > > > > >> > > > > >> Dong
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> On Tue, Jul 3, 2018 at 4:38 PM, Lucas
> > Wang <
> > > >> > > > > > > > >> [email protected]
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > > wrote:
> > > >> > > > > > > > >> > > > > >>
> > > >> > > > > > > > >> > > > > >> > Thanks for the insightful comment,
> Jun.
> > > >> > > > > > > > >> > > > > >> >
> > > >> > > > > > > > >> > > > > >> > @Dong,
> > > >> > > > > > > > >> > > > > >> > Since both of the two comments in your
> > > >> previous
> > > >> > > > email
> > > >> > > > > > are
> > > >> > > > > > > > >> about
> > > >> > > > > > > > >> > > the
> > > >> > > > > > > > >> > > > > >> > benefits of this KIP and whether it's
> > > >> useful,
> > > >> > > > > > > > >> > > > > >> > in light of Jun's last comment, do you
> > > agree
> > > >> > that
> > > >> > > > > this
> > > >> > > > > > > KIP
> > > >> > > > > > > > >> can
> > > >> > > > > > > > >> > be
> > > >> > > > > > > > >> > > > > >> > beneficial in the case mentioned by
> Jun?
> > > >> > > > > > > > >> > > > > >> > Please let me know, thanks!
> > > >> > > > > > > > >> > > > > >> >
> > > >> > > > > > > > >> > > > > >> > Regards,
> > > >> > > > > > > > >> > > > > >> > Lucas
> > > >> > > > > > > > >> > > > > >> >
> > > >> > > > > > > > >> > > > > >> > On Tue, Jul 3, 2018 at 2:07 PM, Jun
> Rao
> > <
> > > >> > > > > > > [email protected]>
> > > >> > > > > > > > >> > wrote:
> > > >> > > > > > > > >> > > > > >> >
> > > >> > > > > > > > >> > > > > >> > > Hi, Lucas, Dong,
> > > >> > > > > > > > >> > > > > >> > >
> > > >> > > > > > > > >> > > > > >> > > If all disks on a broker are slow,
> one
> > > >> > probably
> > > >> > > > > > should
> > > >> > > > > > > > just
> > > >> > > > > > > > >> > kill
> > > >> > > > > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> > > broker. In that case, this KIP may
> not
> > > >> help.
> > > >> > If
> > > >> > > > > only
> > > >> > > > > > > one
> > > >> > > > > > > > of
> > > >> > > > > > > > >> > the
> > > >> > > > > > > > >> > > > > disks
> > > >> > > > > > > > >> > > > > >> on
> > > >> > > > > > > > >> > > > > >> > a
> > > >> > > > > > > > >> > > > > >> > > broker is slow, one may want to fail
> > > that
> > > >> > disk
> > > >> > > > and
> > > >> > > > > > move
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > > leaders
> > > >> > > > > > > > >> > > > > on
> > > >> > > > > > > > >> > > > > >> > that
> > > >> > > > > > > > >> > > > > >> > > disk to other brokers. In that case,
> > > being
> > > >> > able
> > > >> > > > to
> > > >> > > > > > > > process
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > > >> > LeaderAndIsr
> > > >> > > > > > > > >> > > > > >> > > requests faster will potentially
> help
> > > the
> > > >> > > > producers
> > > >> > > > > > > > recover
> > > >> > > > > > > > >> > > > quicker.
> > > >> > > > > > > > >> > > > > >> > >
> > > >> > > > > > > > >> > > > > >> > > Thanks,
> > > >> > > > > > > > >> > > > > >> > >
> > > >> > > > > > > > >> > > > > >> > > Jun
> > > >> > > > > > > > >> > > > > >> > >
> > > >> > > > > > > > >> > > > > >> > > On Mon, Jul 2, 2018 at 7:56 PM, Dong
> > > Lin <
> > > >> > > > > > > > >> [email protected]
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > > wrote:
> > > >> > > > > > > > >> > > > > >> > >
> > > >> > > > > > > > >> > > > > >> > > > Hey Lucas,
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > Thanks for the reply. Some follow
> up
> > > >> > > questions
> > > >> > > > > > below.
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > Regarding 1, if each
> ProduceRequest
> > > >> covers
> > > >> > 20
> > > >> > > > > > > > partitions
> > > >> > > > > > > > >> > that
> > > >> > > > > > > > >> > > > are
> > > >> > > > > > > > >> > > > > >> > > randomly
> > > >> > > > > > > > >> > > > > >> > > > distributed across all partitions,
> > > then
> > > >> > each
> > > >> > > > > > > > >> ProduceRequest
> > > >> > > > > > > > >> > > will
> > > >> > > > > > > > >> > > > > >> likely
> > > >> > > > > > > > >> > > > > >> > > > cover some partitions for which
> the
> > > >> broker
> > > >> > is
> > > >> > > > > still
> > > >> > > > > > > > >> leader
> > > >> > > > > > > > >> > > after
> > > >> > > > > > > > >> > > > > it
> > > >> > > > > > > > >> > > > > >> > > quickly
> > > >> > > > > > > > >> > > > > >> > > > processes the
> > > >> > > > > > > > >> > > > > >> > > > LeaderAndIsrRequest. Then broker
> > will
> > > >> still
> > > >> > > be
> > > >> > > > > slow
> > > >> > > > > > > in
> > > >> > > > > > > > >> > > > processing
> > > >> > > > > > > > >> > > > > >> these
> > > >> > > > > > > > >> > > > > >> > > > ProduceRequest and request will
> > still
> > > be
> > > >> > very
> > > >> > > > > high
> > > >> > > > > > > with
> > > >> > > > > > > > >> this
> > > >> > > > > > > > >> > > > KIP.
> > > >> > > > > > > > >> > > > > It
> > > >> > > > > > > > >> > > > > >> > > seems
> > > >> > > > > > > > >> > > > > >> > > > that most ProduceRequest will
> still
> > > >> timeout
> > > >> > > > after
> > > >> > > > > > 30
> > > >> > > > > > > > >> > seconds.
> > > >> > > > > > > > >> > > Is
> > > >> > > > > > > > >> > > > > >> this
> > > >> > > > > > > > >> > > > > >> > > > understanding correct?
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > Regarding 2, if most
> ProduceRequest
> > > will
> > > >> > > still
> > > >> > > > > > > timeout
> > > >> > > > > > > > >> after
> > > >> > > > > > > > >> > > 30
> > > >> > > > > > > > >> > > > > >> > seconds,
> > > >> > > > > > > > >> > > > > >> > > > then it is less clear how this KIP
> > > >> reduces
> > > >> > > > > average
> > > >> > > > > > > > >> produce
> > > >> > > > > > > > >> > > > > latency.
> > > >> > > > > > > > >> > > > > >> Can
> > > >> > > > > > > > >> > > > > >> > > you
> > > >> > > > > > > > >> > > > > >> > > > clarify what metrics can be
> improved
> > > by
> > > >> > this
> > > >> > > > KIP?
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > Not sure why system operator
> > directly
> > > >> cares
> > > >> > > > > number
> > > >> > > > > > of
> > > >> > > > > > > > >> > > truncated
> > > >> > > > > > > > >> > > > > >> > messages.
> > > >> > > > > > > > >> > > > > >> > > > Do you mean this KIP can improve
> > > average
> > > >> > > > > throughput
> > > >> > > > > > > or
> > > >> > > > > > > > >> > reduce
> > > >> > > > > > > > >> > > > > >> message
> > > >> > > > > > > > >> > > > > >> > > > duplication? It will be good to
> > > >> understand
> > > >> > > > this.
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > Thanks,
> > > >> > > > > > > > >> > > > > >> > > > Dong
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > On Tue, 3 Jul 2018 at 7:12 AM
> Lucas
> > > >> Wang <
> > > >> > > > > > > > >> > > [email protected]
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > wrote:
> > > >> > > > > > > > >> > > > > >> > > >
> > > >> > > > > > > > >> > > > > >> > > > > Hi Dong,
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > Thanks for your valuable
> comments.
> > > >> Please
> > > >> > > see
> > > >> > > > > my
> > > >> > > > > > > > reply
> > > >> > > > > > > > >> > > below.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > 1. The Google doc showed only 1
> > > >> > partition.
> > > >> > > > Now
> > > >> > > > > > > let's
> > > >> > > > > > > > >> > > consider
> > > >> > > > > > > > >> > > > a
> > > >> > > > > > > > >> > > > > >> more
> > > >> > > > > > > > >> > > > > >> > > > common
> > > >> > > > > > > > >> > > > > >> > > > > scenario
> > > >> > > > > > > > >> > > > > >> > > > > where broker0 is the leader of
> > many
> > > >> > > > partitions.
> > > >> > > > > > And
> > > >> > > > > > > > >> let's
> > > >> > > > > > > > >> > > say
> > > >> > > > > > > > >> > > > > for
> > > >> > > > > > > > >> > > > > >> > some
> > > >> > > > > > > > >> > > > > >> > > > > reason its IO becomes slow.
> > > >> > > > > > > > >> > > > > >> > > > > The number of leader partitions
> on
> > > >> > broker0
> > > >> > > is
> > > >> > > > > so
> > > >> > > > > > > > large,
> > > >> > > > > > > > >> > say
> > > >> > > > > > > > >> > > > 10K,
> > > >> > > > > > > > >> > > > > >> that
> > > >> > > > > > > > >> > > > > >> > > the
> > > >> > > > > > > > >> > > > > >> > > > > cluster is skewed,
> > > >> > > > > > > > >> > > > > >> > > > > and the operator would like to
> > shift
> > > >> the
> > > >> > > > > > leadership
> > > >> > > > > > > > >> for a
> > > >> > > > > > > > >> > > lot
> > > >> > > > > > > > >> > > > of
> > > >> > > > > > > > >> > > > > >> > > > > partitions, say 9K, to other
> > > brokers,
> > > >> > > > > > > > >> > > > > >> > > > > either manually or through some
> > > >> service
> > > >> > > like
> > > >> > > > > > cruise
> > > >> > > > > > > > >> > control.
> > > >> > > > > > > > >> > > > > >> > > > > With this KIP, not only will the
> > > >> > leadership
> > > >> > > > > > > > transitions
> > > >> > > > > > > > >> > > finish
> > > >> > > > > > > > >> > > > > >> more
> > > >> > > > > > > > >> > > > > >> > > > > quickly, helping the cluster
> > itself
> > > >> > > becoming
> > > >> > > > > more
> > > >> > > > > > > > >> > balanced,
> > > >> > > > > > > > >> > > > > >> > > > > but all existing producers
> > > >> corresponding
> > > >> > to
> > > >> > > > the
> > > >> > > > > > 9K
> > > >> > > > > > > > >> > > partitions
> > > >> > > > > > > > >> > > > > will
> > > >> > > > > > > > >> > > > > >> > get
> > > >> > > > > > > > >> > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> > > > > errors relatively quickly
> > > >> > > > > > > > >> > > > > >> > > > > rather than relying on their
> > > timeout,
> > > >> > > thanks
> > > >> > > > to
> > > >> > > > > > the
> > > >> > > > > > > > >> > batched
> > > >> > > > > > > > >> > > > > async
> > > >> > > > > > > > >> > > > > >> ZK
> > > >> > > > > > > > >> > > > > >> > > > > operations.
> > > >> > > > > > > > >> > > > > >> > > > > To me it's a useful feature to
> > have
> > > >> > during
> > > >> > > > such
> > > >> > > > > > > > >> > troublesome
> > > >> > > > > > > > >> > > > > times.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > 2. The experiments in the Google
> > Doc
> > > >> have
> > > >> > > > shown
> > > >> > > > > > > that
> > > >> > > > > > > > >> with
> > > >> > > > > > > > >> > > this
> > > >> > > > > > > > >> > > > > KIP
> > > >> > > > > > > > >> > > > > >> > many
> > > >> > > > > > > > >> > > > > >> > > > > producers
> > > >> > > > > > > > >> > > > > >> > > > > receive an explicit error
> > > >> > > > > NotLeaderForPartition,
> > > >> > > > > > > > based
> > > >> > > > > > > > >> on
> > > >> > > > > > > > >> > > > which
> > > >> > > > > > > > >> > > > > >> they
> > > >> > > > > > > > >> > > > > >> > > > retry
> > > >> > > > > > > > >> > > > > >> > > > > immediately.
> > > >> > > > > > > > >> > > > > >> > > > > Therefore the latency (~14
> > > >> seconds+quick
> > > >> > > > retry)
> > > >> > > > > > for
> > > >> > > > > > > > >> their
> > > >> > > > > > > > >> > > > single
> > > >> > > > > > > > >> > > > > >> > > message
> > > >> > > > > > > > >> > > > > >> > > > is
> > > >> > > > > > > > >> > > > > >> > > > > much smaller
> > > >> > > > > > > > >> > > > > >> > > > > compared with the case of timing
> > out
> > > >> > > without
> > > >> > > > > the
> > > >> > > > > > > KIP
> > > >> > > > > > > > >> (30
> > > >> > > > > > > > >> > > > seconds
> > > >> > > > > > > > >> > > > > >> for
> > > >> > > > > > > > >> > > > > >> > > > timing
> > > >> > > > > > > > >> > > > > >> > > > > out + quick retry).
> > > >> > > > > > > > >> > > > > >> > > > > One might argue that reducing
> the
> > > >> timing
> > > >> > > out
> > > >> > > > on
> > > >> > > > > > the
> > > >> > > > > > > > >> > producer
> > > >> > > > > > > > >> > > > > side
> > > >> > > > > > > > >> > > > > >> can
> > > >> > > > > > > > >> > > > > >> > > > > achieve the same result,
> > > >> > > > > > > > >> > > > > >> > > > > yet reducing the timeout has its
> > own
> > > >> > > > > > drawbacks[1].
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > Also *IF* there were a metric to
> > > show
> > > >> the
> > > >> > > > > number
> > > >> > > > > > of
> > > >> > > > > > > > >> > > truncated
> > > >> > > > > > > > >> > > > > >> > messages
> > > >> > > > > > > > >> > > > > >> > > on
> > > >> > > > > > > > >> > > > > >> > > > > brokers,
> > > >> > > > > > > > >> > > > > >> > > > > with the experiments done in the
> > > >> Google
> > > >> > > Doc,
> > > >> > > > it
> > > >> > > > > > > > should
> > > >> > > > > > > > >> be
> > > >> > > > > > > > >> > > easy
> > > >> > > > > > > > >> > > > > to
> > > >> > > > > > > > >> > > > > >> see
> > > >> > > > > > > > >> > > > > >> > > > that
> > > >> > > > > > > > >> > > > > >> > > > > a lot fewer messages need
> > > >> > > > > > > > >> > > > > >> > > > > to be truncated on broker0 since
> > the
> > > >> > > > up-to-date
> > > >> > > > > > > > >> metadata
> > > >> > > > > > > > >> > > > avoids
> > > >> > > > > > > > >> > > > > >> > > appending
> > > >> > > > > > > > >> > > > > >> > > > > of messages
> > > >> > > > > > > > >> > > > > >> > > > > in subsequent PRODUCE requests.
> If
> > > we
> > > >> > talk
> > > >> > > > to a
> > > >> > > > > > > > system
> > > >> > > > > > > > >> > > > operator
> > > >> > > > > > > > >> > > > > >> and
> > > >> > > > > > > > >> > > > > >> > ask
> > > >> > > > > > > > >> > > > > >> > > > > whether
> > > >> > > > > > > > >> > > > > >> > > > > they prefer fewer wasteful IOs,
> I
> > > bet
> > > >> > most
> > > >> > > > > likely
> > > >> > > > > > > the
> > > >> > > > > > > > >> > answer
> > > >> > > > > > > > >> > > > is
> > > >> > > > > > > > >> > > > > >> yes.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > 3. To answer your question, I
> > think
> > > it
> > > >> > > might
> > > >> > > > be
> > > >> > > > > > > > >> helpful to
> > > >> > > > > > > > >> > > > > >> construct
> > > >> > > > > > > > >> > > > > >> > > some
> > > >> > > > > > > > >> > > > > >> > > > > formulas.
> > > >> > > > > > > > >> > > > > >> > > > > To simplify the modeling, I'm
> > going
> > > >> back
> > > >> > to
> > > >> > > > the
> > > >> > > > > > > case
> > > >> > > > > > > > >> where
> > > >> > > > > > > > >> > > > there
> > > >> > > > > > > > >> > > > > >> is
> > > >> > > > > > > > >> > > > > >> > > only
> > > >> > > > > > > > >> > > > > >> > > > > ONE partition involved.
> > > >> > > > > > > > >> > > > > >> > > > > Following the experiments in the
> > > >> Google
> > > >> > > Doc,
> > > >> > > > > > let's
> > > >> > > > > > > > say
> > > >> > > > > > > > >> > > broker0
> > > >> > > > > > > > >> > > > > >> > becomes
> > > >> > > > > > > > >> > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> > > > > follower at time t0,
> > > >> > > > > > > > >> > > > > >> > > > > and after t0 there were still N
> > > >> produce
> > > >> > > > > requests
> > > >> > > > > > in
> > > >> > > > > > > > its
> > > >> > > > > > > > >> > > > request
> > > >> > > > > > > > >> > > > > >> > queue.
> > > >> > > > > > > > >> > > > > >> > > > > With the up-to-date metadata
> > brought
> > > >> by
> > > >> > > this
> > > >> > > > > KIP,
> > > >> > > > > > > > >> broker0
> > > >> > > > > > > > >> > > can
> > > >> > > > > > > > >> > > > > >> reply
> > > >> > > > > > > > >> > > > > >> > > with
> > > >> > > > > > > > >> > > > > >> > > > an
> > > >> > > > > > > > >> > > > > >> > > > > NotLeaderForPartition exception,
> > > >> > > > > > > > >> > > > > >> > > > > let's use M1 to denote the
> average
> > > >> > > processing
> > > >> > > > > > time
> > > >> > > > > > > of
> > > >> > > > > > > > >> > > replying
> > > >> > > > > > > > >> > > > > >> with
> > > >> > > > > > > > >> > > > > >> > > such
> > > >> > > > > > > > >> > > > > >> > > > an
> > > >> > > > > > > > >> > > > > >> > > > > error message.
> > > >> > > > > > > > >> > > > > >> > > > > Without this KIP, the broker
> will
> > > >> need to
> > > >> > > > > append
> > > >> > > > > > > > >> messages
> > > >> > > > > > > > >> > to
> > > >> > > > > > > > >> > > > > >> > segments,
> > > >> > > > > > > > >> > > > > >> > > > > which may trigger a flush to
> disk,
> > > >> > > > > > > > >> > > > > >> > > > > let's use M2 to denote the
> average
> > > >> > > processing
> > > >> > > > > > time
> > > >> > > > > > > > for
> > > >> > > > > > > > >> > such
> > > >> > > > > > > > >> > > > > logic.
> > > >> > > > > > > > >> > > > > >> > > > > Then the average extra latency
> > > >> incurred
> > > >> > > > without
> > > >> > > > > > > this
> > > >> > > > > > > > >> KIP
> > > >> > > > > > > > >> > is
> > > >> > > > > > > > >> > > N
> > > >> > > > > > > > >> > > > *
> > > >> > > > > > > > >> > > > > >> (M2 -
> > > >> > > > > > > > >> > > > > >> > > > M1) /
> > > >> > > > > > > > >> > > > > >> > > > > 2.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > In practice, M2 should always be
> > > >> larger
> > > >> > > than
> > > >> > > > > M1,
> > > >> > > > > > > > which
> > > >> > > > > > > > >> > means
> > > >> > > > > > > > >> > > > as
> > > >> > > > > > > > >> > > > > >> long
> > > >> > > > > > > > >> > > > > >> > > as N
> > > >> > > > > > > > >> > > > > >> > > > > is positive,
> > > >> > > > > > > > >> > > > > >> > > > > we would see improvements on the
> > > >> average
> > > >> > > > > latency.
> > > >> > > > > > > > >> > > > > >> > > > > There does not need to be
> > > significant
> > > >> > > backlog
> > > >> > > > > of
> > > >> > > > > > > > >> requests
> > > >> > > > > > > > >> > in
> > > >> > > > > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> > > request
> > > >> > > > > > > > >> > > > > >> > > > > queue,
> > > >> > > > > > > > >> > > > > >> > > > > or severe degradation of disk
> > > >> performance
> > > >> > > to
> > > >> > > > > have
> > > >> > > > > > > the
> > > >> > > > > > > > >> > > > > improvement.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > Regards,
> > > >> > > > > > > > >> > > > > >> > > > > Lucas
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > [1] For instance, reducing the
> > > >> timeout on
> > > >> > > the
> > > >> > > > > > > > producer
> > > >> > > > > > > > >> > side
> > > >> > > > > > > > >> > > > can
> > > >> > > > > > > > >> > > > > >> > trigger
> > > >> > > > > > > > >> > > > > >> > > > > unnecessary duplicate requests
> > > >> > > > > > > > >> > > > > >> > > > > when the corresponding leader
> > broker
> > > >> is
> > > >> > > > > > overloaded,
> > > >> > > > > > > > >> > > > exacerbating
> > > >> > > > > > > > >> > > > > >> the
> > > >> > > > > > > > >> > > > > >> > > > > situation.
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > On Sun, Jul 1, 2018 at 9:18 PM,
> > Dong
> > > >> Lin
> > > >> > <
> > > >> > > > > > > > >> > > [email protected]
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > wrote:
> > > >> > > > > > > > >> > > > > >> > > > >
> > > >> > > > > > > > >> > > > > >> > > > > > Hey Lucas,
> > > >> > > > > > > > >> > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >> > > > > > Thanks much for the detailed
> > > >> > > documentation
> > > >> > > > of
> > > >> > > > > > the
> > > >> > > > > > > > >> > > > experiment.
> > > >> > > > > > > > >> > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >> > > > > > Initially I also think having
> a
> > > >> > separate
> > > >> > > > > queue
> > > >> > > > > > > for
> > > >> > > > > > > > >> > > > controller
> > > >> > > > > > > > >> > > > > >> > > requests
> > > >> > > > > > > > >> > > > > >> > > > is
> > > >> > > > > > > > >> > > > > >> > > > > > useful because, as you
> mentioned
> > > in
> > > >> the
> > > >> > > > > summary
> > > >> > > > > > > > >> section
> > > >> > > > > > > > >> > of
> > > >> > > > > > > > >> > > > the
> > > >> > > > > > > > >> > > > > >> > Google
> > > >> > > > > > > > >> > > > > >> > > > > doc,
> > > >> > > > > > > > >> > > > > >> > > > > > controller requests are
> > generally
> > > >> more
> > > >> > >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: [DISCUSS] KIP-291: Have separate queues for control requests and data requests

Reply via email to