Re: [DISCUSS] KIP-73: Replication Quotas

Gwen Shapira Thu, 18 Aug 2016 10:54:24 -0700

Just my take, since Jun and Ben originally wanted to solve a more
general approach and I talked them out of it :)


When we first add the feature, safety is probably most important in
getting people to adopt it - I wanted to make the feature very safe by
never throttling something admins don't want to throttle. So we
figured manual approach, while more challenging to configure, is the
safest. Admins usually know which replicas are "at risk" of taking
over and can choose to throttle them accordingly, they can build their
own integration with monitoring tools, etc.

It feels like any "smarts" we try and build into Kafka can be done
better with external tools that can watch both Kafka traffic (with the
new metrics) and things like network and CPU monitors.

We are open to a smarter approach in Kafka, but perhaps plan it for a
follow-up KIP? Maybe even after we have some experience with the
manual approach and how best to make throttling decisions.
Similar to what we do with choosing partitions to move around - we
started manually, admins are getting experience at how they like to
choose replicas and then we can bake their expertise into the product.

Gwen

On Thu, Aug 18, 2016 at 10:29 AM, Jun Rao <[email protected]> wrote:
> Joel,
>
> Yes, for your second comment. The tricky thing is still to figure out which
> replicas to throttle and by how much since in general, admins probably
> don't want already in-sync or close to in-sync replicas to be throttled. It
> would be great to get Todd's opinion on this. Could you ping him?
>
> Yes, we'd be happy to discuss auto-detection of effect traffic more offline.
>
> Thanks,
>
> Jun
>
> On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <[email protected]> wrote:
>
>> > For your first comment. We thought about determining "effect" replicas
>> > automatically as well. First, there are some tricky stuff that one has to
>> >
>>
>> Auto-detection of effect traffic: i'm fairly certain it's doable but
>> definitely tricky. I'm also not sure it is something worth tackling at the
>> outset. If we want to spend more time thinking over it even if it's just an
>> academic exercise I would be happy to brainstorm offline.
>>
>>
>> > For your second comment, we discussed that in the client quotas design. A
>> > down side of that for client quotas is that a client may be surprised
>> that
>> > its traffic is not throttled at one time, but throttled as another with
>> the
>> > same quota (basically, less predicability). You can imaging setting a
>> quota
>> > for all replication traffic and only slow down the "effect" replicas if
>> > needed. The thought is more or less the same as the above. It requires
>> more
>> >
>>
>> For clients, this is true. I think this is much less of an issue for
>> server-side replication since the "users" here are the Kafka SREs who
>> generally know these internal details.
>>
>> I think it would be valuable to get some feedback from SREs on the proposal
>> before proceeding to a vote. (ping Todd)
>>
>> Joel
>>
>>
>> >
>> > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <[email protected]> wrote:
>> >
>> > > Hi Joel
>> > >
>> > > Ha! yes we had some similar thoughts, on both counts. Both are actually
>> > > good approaches, but come with some extra complexity.
>> > >
>> > > Segregating the replication type is tempting as it creates a more
>> general
>> > > solution. One issue is you need to draw a line between lagging and not
>> > > lagging. The ISR ‘limit' is a tempting divider, but has the side effect
>> > > that, once you drop out you get immediately throttled. Adding a
>> > > configurable divider is another option, but difficult for admins to
>> set,
>> > > and always a little arbitrary. A better idea is to prioritise, in
>> reverse
>> > > order to lag. But that also comes with additional complexity of its
>> own.
>> > >
>> > > Under throttling is also a tempting addition. That’s to say, if there’s
>> > > idle bandwidth lying around, not being used, why not use it to let
>> > lagging
>> > > brokers catch up. This involves some comparison to the maximum
>> bandwidth,
>> > > which could be configurable, or could be derived, with pros and cons
>> for
>> > > each.
>> > >
>> > > But the more general problem is actually quite hard to reason about, so
>> > > after some discussion we decided to settle on something simple, that we
>> > > felt we could get working, and extend to add these additional features
>> as
>> > > subsequent KIPs.
>> > >
>> > > I hope that seems reasonable. Jun may wish to add to this.
>> > >
>> > > B
>> > >
>> > >
>> > > > On 18 Aug 2016, at 06:56, Joel Koshy <[email protected]> wrote:
>> > > >
>> > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <[email protected]>
>> > wrote:
>> > > >
>> > > >>
>> > > >> Let's us know if you have any further thoughts on KIP-73, else we'll
>> > > kick
>> > > >> off a vote.
>> > > >>
>> > > >
>> > > > I think the mechanism for throttling replicas looks good. Just had a
>> > few
>> > > > more thoughts on the configuration section. What you have looks
>> > > reasonable,
>> > > > but I was wondering if it could be made simpler. You probably thought
>> > > > through these, so I'm curious to know your take.
>> > > >
>> > > > My guess is that most of the time, users would want to throttle all
>> > > effect
>> > > > replication - due to partition reassignments, adding brokers or a
>> > broker
>> > > > coming back online after an extended period of time. In all these
>> > > scenarios
>> > > > it may be possible to distinguish bootstrap (effect) vs normal
>> > > replication
>> > > > - based on how far the replica has to catch up. I'm wondering if it
>> is
>> > > > enough to just set an umbrella "effect" replication quota with
>> perhaps
>> > > > per-topic overrides (say if some topics are more important than
>> others)
>> > > as
>> > > > opposed to designating throttled replicas.
>> > > >
>> > > > Also, IIRC during client-side quota discussions we had considered the
>> > > > possibility of allowing clients to go above their quotas when
>> resources
>> > > are
>> > > > available. We ended up not doing that, but for replication throttling
>> > it
>> > > > may make sense - i.e., to treat the quota as a soft limit. Another
>> way
>> > to
>> > > > look at it is instead of ensuring "effect replication traffic does
>> not
>> > > flow
>> > > > faster than X bytes/sec" it may be useful to instead ensure that
>> > "effect
>> > > > replication traffic only flows as slowly as necessary (so as not to
>> > > > adversely affect normal replication traffic)."
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Joel
>> > > >
>> > > >>>
>> > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <[email protected]
>> > > >>> <javascript:;>> wrote:
>> > > >>>>
>> > > >>>>> Hi, Joel,
>> > > >>>>>
>> > > >>>>> Yes, the response size includes both throttled and unthrottled
>> > > >>> replicas.
>> > > >>>>> However, the response is only delayed up to max.wait if the
>> > response
>> > > >>> size
>> > > >>>>> is less than min.bytes, which matches the current behavior. So,
>> > there
>> > > >>> is
>> > > >>>> no
>> > > >>>>> extra delay to due throttling, right? For replica fetchers, the
>> > > >> default
>> > > >>>>> min.byte is 1. So, the response is only delayed if there is no
>> byte
>> > > >> in
>> > > >>>> the
>> > > >>>>> response, which is what we want.
>> > > >>>>>
>> > > >>>>> Thanks,
>> > > >>>>>
>> > > >>>>> Jun
>> > > >>>>>
>> > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <
>> [email protected]
>> > > >>> <javascript:;>>
>> > > >>>> wrote:
>> > > >>>>>
>> > > >>>>>> Hi Jun,
>> > > >>>>>>
>> > > >>>>>> I'm not sure that would work unless we have separate replica
>> > > >>> fetchers,
>> > > >>>>>> since this would cause all replicas (including ones that are not
>> > > >>>>> throttled)
>> > > >>>>>> to get delayed. Instead, we could just have the leader populate
>> > the
>> > > >>>>>> throttle-time field of the response as a hint to the follower as
>> > to
>> > > >>> how
>> > > >>>>>> long it should wait before it adds those replicas back to its
>> > > >>>> subsequent
>> > > >>>>>> replica fetch requests.
>> > > >>>>>>
>> > > >>>>>> Thanks,
>> > > >>>>>>
>> > > >>>>>> Joel
>> > > >>>>>>
>> > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <[email protected]
>> > > >>> <javascript:;>> wrote:
>> > > >>>>>>
>> > > >>>>>>> Mayuresh,
>> > > >>>>>>>
>> > > >>>>>>> That's a good question. I think if the response size (after
>> > > >> leader
>> > > >>>>>>> throttling) is smaller than min.bytes, we will just delay the
>> > > >>> sending
>> > > >>>>> of
>> > > >>>>>>> the response up to max.wait as we do now. This should prevent
>> > > >>>> frequent
>> > > >>>>>>> empty responses to the follower.
>> > > >>>>>>>
>> > > >>>>>>> Thanks,
>> > > >>>>>>>
>> > > >>>>>>> Jun
>> > > >>>>>>>
>> > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat <
>> > > >>>>>>> [email protected] <javascript:;>
>> > > >>>>>>>> wrote:
>> > > >>>>>>>
>> > > >>>>>>>> This might have been answered before.
>> > > >>>>>>>> I was wondering when the leader quota is reached and it sends
>> > > >>> empty
>> > > >>>>>>>> response ( If the inclusion of a partition, listed in the
>> > > >>> leader's
>> > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be
>> > > >>> exceeded,
>> > > >>>>>> that
>> > > >>>>>>>> partition is omitted from the response (aka returns 0
>> bytes).).
>> > > >>> At
>> > > >>>>> this
>> > > >>>>>>>> point the follower quota is NOT reached and the follower is
>> > > >> still
>> > > >>>>> going
>> > > >>>>>>> to
>> > > >>>>>>>> ask for the that partition in the next fetch request. Would it
>> > > >> be
>> > > >>>>> fair
>> > > >>>>>> to
>> > > >>>>>>>> add some logic there so that the follower backs off ( for some
>> > > >>>>>>> configurable
>> > > >>>>>>>> time) from including those partitions in the next fetch
>> > > >> request?
>> > > >>>>>>>>
>> > > >>>>>>>> Thanks,
>> > > >>>>>>>>
>> > > >>>>>>>> Mayuresh
>> > > >>>>>>>>
>> > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <
>> > > >> [email protected]
>> > > >>> <javascript:;>>
>> > > >>>>>> wrote:
>> > > >>>>>>>>
>> > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the the
>> > > >>>> extra
>> > > >>>>>>>>> fetcher threads from the proposal, switching to the
>> > > >>>> inclusion-based
>> > > >>>>>>>>> approach. The relevant section is:
>> > > >>>>>>>>>
>> > > >>>>>>>>> The follower makes a requests, using the fixed size of
>> > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 <
>> > > >>>>>>>> https://cwiki.apache.org/
>> > > >>>>>>>>> confluence/display/KAFKA/KIP-74%3A+Add+Fetch+Response+Size+
>> > > >>>>>>>> Limit+in+Bytes>.
>> > > >>>>>>>>> The order of the partitions in the fetch request are
>> > > >> randomised
>> > > >>>> to
>> > > >>>>>>> ensure
>> > > >>>>>>>>> fairness.
>> > > >>>>>>>>> When the leader receives the fetch request it processes the
>> > > >>>>>> partitions
>> > > >>>>>>> in
>> > > >>>>>>>>> the defined order, up to the response's size limit. If the
>> > > >>>>> inclusion
>> > > >>>>>>> of a
>> > > >>>>>>>>> partition, listed in the leader's throttled-replicas list,
>> > > >>> causes
>> > > >>>>> the
>> > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted
>> > > >> from
>> > > >>>> the
>> > > >>>>>>>> response
>> > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form:
>> > > >>>>>>>>> var bytesAllowedForThrottledPartition =
>> > > >>>>> quota.recordAndMaybeAdjust(
>> > > >>>>>>>>> bytesRequestedForPartition)
>> > > >>>>>>>>> When the follower receives the fetch response, if it includes
>> > > >>>>>>> partitions
>> > > >>>>>>>>> in its throttled-partitions list, it increments the
>> > > >>>>>> FollowerQuotaRate:
>> > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean =
>> > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes)
>> > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be
>> > > >>>> included
>> > > >>>>> in
>> > > >>>>>>> the
>> > > >>>>>>>>> next fetch request emitted by this replica fetcher thread.
>> > > >>>>>>>>>
>> > > >>>>>>>>> B
>> > > >>>>>>>>>
>> > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <[email protected]
>> > > >>> <javascript:;>> wrote:
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> When there are several unthrottled replicas, we could also
>> > > >>> just
>> > > >>>>> do
>> > > >>>>>>>> what's
>> > > >>>>>>>>>> suggested in KIP-74. The client is responsible for
>> > > >> reordering
>> > > >>>> the
>> > > >>>>>>>>>> partitions and the leader fills in the bytes to those
>> > > >>>> partitions
>> > > >>>>> in
>> > > >>>>>>>>> order,
>> > > >>>>>>>>>> up to the quota limit.
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> We could also do what you suggested. If quota is exceeded,
>> > > >>>>> include
>> > > >>>>>>>> empty
>> > > >>>>>>>>>> data in the response for throttled replicas. Keep doing
>> > > >> that
>> > > >>>>> until
>> > > >>>>>>>> enough
>> > > >>>>>>>>>> time has passed so that the quota is no longer exceeded.
>> > > >> This
>> > > >>>>>>>> potentially
>> > > >>>>>>>>>> allows better batching per partition. Not sure if the two
>> > > >>>> makes a
>> > > >>>>>> big
>> > > >>>>>>>>>> difference in practice though.
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> Thanks,
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> Jun
>> > > >>>>>>>>>>
>> > > >>>>>>>>>>
>> > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <
>> > > >>>> [email protected] <javascript:;>>
>> > > >>>>>>>> wrote:
>> > > >>>>>>>>>>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>> On the leader side, one challenge is related to the
>> > > >>> fairness
>> > > >>>>>> issue
>> > > >>>>>>>> that
>> > > >>>>>>>>>>> Ben
>> > > >>>>>>>>>>>> brought up. The question is what if the fetch response
>> > > >>> limit
>> > > >>>> is
>> > > >>>>>>>> filled
>> > > >>>>>>>>> up
>> > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly, we
>> > > >>>> will
>> > > >>>>>>> delay
>> > > >>>>>>>>> the
>> > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I think
>> > > >>> we
>> > > >>>>> can
>> > > >>>>>>>>> address
>> > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled replicas
>> > > >> in
>> > > >>>> the
>> > > >>>>>>>>> response
>> > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled
>> > > >>>> replicas
>> > > >>>>>> up
>> > > >>>>>>> to
>> > > >>>>>>>>> the
>> > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up
>> > > >>>> throttled
>> > > >>>>>>>>> replicas.
>> > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce
>> > > >> the
>> > > >>>>> bytes
>> > > >>>>>>> in
>> > > >>>>>>>>> the
>> > > >>>>>>>>>>>> throttled replicas in the response accordingly.
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation
>> > > >> (vs
>> > > >>>>>> empty).
>> > > >>>>>>>> So
>> > > >>>>>>>>> we
>> > > >>>>>>>>>>> would attempt to fill the response for throttled
>> > > >> partitions
>> > > >>> as
>> > > >>>>>> much
>> > > >>>>>>> as
>> > > >>>>>>>>> we
>> > > >>>>>>>>>>> can before hitting the quota limit. There is one more
>> > > >> detail
>> > > >>>> to
>> > > >>>>>>> handle
>> > > >>>>>>>>> in
>> > > >>>>>>>>>>> this: if there are several throttled partitions and not
>> > > >>> enough
>> > > >>>>>>>> remaining
>> > > >>>>>>>>>>> allowance in the fetch response to include all the
>> > > >> throttled
>> > > >>>>>>> replicas
>> > > >>>>>>>>> then
>> > > >>>>>>>>>>> we would need to decide which of those partitions get a
>> > > >>> share;
>> > > >>>>>> which
>> > > >>>>>>>> is
>> > > >>>>>>>>> why
>> > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those
>> > > >>>>> partitions
>> > > >>>>>>>>> entirely
>> > > >>>>>>>>>>> in the fetch response - they will make progress in the
>> > > >>>>> subsequent
>> > > >>>>>>>>> fetch. If
>> > > >>>>>>>>>>> they don't make fast enough progress then that would be a
>> > > >>> case
>> > > >>>>> for
>> > > >>>>>>>>> raising
>> > > >>>>>>>>>>> the threshold or letting it complete at an off-peak time.
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>> With this approach, we need some new logic to handle
>> > > >>>> throttling
>> > > >>>>>> on
>> > > >>>>>>>> the
>> > > >>>>>>>>>>>> leader, but we can leave the replica threading model
>> > > >>>> unchanged.
>> > > >>>>>> So,
>> > > >>>>>>>>>>>> overall, this still seems to be a simpler approach.
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>> Thanks,
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>> Jun
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
>> > > >>>>>>>>>>>> [email protected] <javascript:;>
>> > > >>>>>>>>>>>>> wrote:
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>>> Nice write up Ben.
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding
>> > > >> the
>> > > >>>>>>>> partitions
>> > > >>>>>>>>>>>> from
>> > > >>>>>>>>>>>>> the fetch request/response when the quota is violated at
>> > > >>> the
>> > > >>>>>>>> follower
>> > > >>>>>>>>>>> or
>> > > >>>>>>>>>>>>> leader instead of having a separate set of threads for
>> > > >>>>> handling
>> > > >>>>>>> the
>> > > >>>>>>>>>>> quota
>> > > >>>>>>>>>>>>> and non quota cases. Even though its different from the
>> > > >>>>> current
>> > > >>>>>>>> quota
>> > > >>>>>>>>>>>>> implementation it should be OK since its internal to
>> > > >>> brokers
>> > > >>>>> and
>> > > >>>>>>> can
>> > > >>>>>>>>> be
>> > > >>>>>>>>>>>>> handled by tuning the quota configs for it appropriately
>> > > >>> by
>> > > >>>>> the
>> > > >>>>>>>>> admins.
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> Also can you elaborate with an example how this would be
>> > > >>>>>> handled :
>> > > >>>>>>>>>>>>> *guaranteeing
>> > > >>>>>>>>>>>>> ordering of updates when replicas shift threads*
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> Thanks,
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> Mayuresh
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <
>> > > >>>>>> [email protected] <javascript:;>>
>> > > >>>>>>>>>>> wrote:
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that
>> > > >>> makes
>> > > >>>>>>> sense -
>> > > >>>>>>>>>>>>> thanks
>> > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this
>> > > >> detail
>> > > >>> to
>> > > >>>>> the
>> > > >>>>>>>> doc -
>> > > >>>>>>>>>>>>> say,
>> > > >>>>>>>>>>>>>> after the quote that I pasted earlier?
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still
>> > > >>>>> interested
>> > > >>>>>>> in
>> > > >>>>>>>>>>> the
>> > > >>>>>>>>>>>>>> simplicity of not having to add separate replica
>> > > >>> fetchers,
>> > > >>>>>> delay
>> > > >>>>>>>>>>> queue
>> > > >>>>>>>>>>>> on
>> > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled
>> > > >>>> replica
>> > > >>>>>>>> fetchers
>> > > >>>>>>>>>>>> to
>> > > >>>>>>>>>>>>>> the regular replica fetchers once caught up.
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to
>> > > >> include
>> > > >>> or
>> > > >>>>>>> exclude
>> > > >>>>>>>>>>> the
>> > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and
>> > > >>> fetch
>> > > >>>>>>>> response
>> > > >>>>>>>>>>>> from
>> > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of
>> > > >>>> fairness
>> > > >>>>>> that
>> > > >>>>>>>> Ben
>> > > >>>>>>>>>>>>> noted
>> > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote
>> > > >> in
>> > > >>>> his
>> > > >>>>>>>> email).
>> > > >>>>>>>>>>>> With
>> > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get
>> > > >> delayed
>> > > >>>>>>>> essentially
>> > > >>>>>>>>>>>> at
>> > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota
>> > > >>>> violation
>> > > >>>>>>> gets
>> > > >>>>>>>>>>>>> delayed
>> > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in
>> > > >>> choosing
>> > > >>>>> to
>> > > >>>>>>>>>>> truncate
>> > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time of
>> > > >>>>> handling
>> > > >>>>>>> the
>> > > >>>>>>>>>>>> fetch
>> > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then
>> > > >>> either
>> > > >>>>>> empty
>> > > >>>>>>>> or
>> > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW
>> > > >> effect
>> > > >>>>>>>> replication
>> > > >>>>>>>>>>> is
>> > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due to
>> > > >>>>>> partition
>> > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.)
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> While this may be slightly different from the existing
>> > > >>>> quota
>> > > >>>>>>>>>>> mechanism
>> > > >>>>>>>>>>>> I
>> > > >>>>>>>>>>>>>> think the difference is small (since we would reuse the
>> > > >>>> quota
>> > > >>>>>>>> manager
>> > > >>>>>>>>>>>> at
>> > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to
>> > > >> the
>> > > >>>>>> broker.
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> So I guess the question is if this alternative is
>> > > >> simpler
>> > > >>>>>> enough
>> > > >>>>>>>> and
>> > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled
>> > > >>>> replica
>> > > >>>>>>>>>>> fetchers.
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <
>> > > >>> [email protected] <javascript:;>>
>> > > >>>>>>> wrote:
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need
>> > > >>> throttling
>> > > >>>> on
>> > > >>>>>>> both
>> > > >>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>> leader and the follower side.
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> If we only have throttling on the follower side,
>> > > >>> consider
>> > > >>>> a
>> > > >>>>>> case
>> > > >>>>>>>>>>> that
>> > > >>>>>>>>>>>>> we
>> > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some replicas
>> > > >>> from
>> > > >>>>>>>> existing
>> > > >>>>>>>>>>>>>> brokers
>> > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is going
>> > > >>> to
>> > > >>>>>> fetch
>> > > >>>>>>>>>>> data
>> > > >>>>>>>>>>>>> from
>> > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the
>> > > >>>>> aggregated
>> > > >>>>>>>> fetch
>> > > >>>>>>>>>>>>> load
>> > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker
>> > > >>>> exceeds
>> > > >>>>>> its
>> > > >>>>>>>>>>>>> outgoing
>> > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding traffic
>> > > >> on
>> > > >>>>> each
>> > > >>>>>> of
>> > > >>>>>>>>>>>> those
>> > > >>>>>>>>>>>>> 5
>> > > >>>>>>>>>>>>>>> brokers is bounded.
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> If we only have throttling on the leader side,
>> > > >> consider
>> > > >>>> the
>> > > >>>>>> same
>> > > >>>>>>>>>>>>> example
>> > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to each
>> > > >> of
>> > > >>>>>> those 5
>> > > >>>>>>>>>>>>> brokers
>> > > >>>>>>>>>>>>>> to
>> > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching data
>> > > >>>> from
>> > > >>>>>> all
>> > > >>>>>>>>>>>>> existing
>> > > >>>>>>>>>>>>>>> brokers.
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower and
>> > > >>> the
>> > > >>>>>>> leader
>> > > >>>>>>>>>>>> side
>> > > >>>>>>>>>>>>>>> protects both cases.
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> Thanks,
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> Jun
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <
>> > > >>>>>> [email protected] <javascript:;>>
>> > > >>>>>>>>>>>> wrote:
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> Hi Joel
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this.
>> > > >>> Appreciated.
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower,
>> > > >> this
>> > > >>>>>> proposal
>> > > >>>>>>>>>>>>> covers
>> > > >>>>>>>>>>>>>> a
>> > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota,
>> > > >> even
>> > > >>>>> when
>> > > >>>>>> a
>> > > >>>>>>>>>>>>>> rebalance
>> > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load.
>> > > >> This
>> > > >>>>> means
>> > > >>>>>>>>>>>>>>> administrators
>> > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a
>> > > >> follower-only
>> > > >>>>> quota
>> > > >>>>>>>>>>> will
>> > > >>>>>>>>>>>>> have
>> > > >>>>>>>>>>>>>>> on
>> > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example
>> > > >>> where
>> > > >>>>>>> replica
>> > > >>>>>>>>>>>>> sizes
>> > > >>>>>>>>>>>>>>> are
>> > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required.
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> Having said that, even with both leader and follower
>> > > >>>>> quotas,
>> > > >>>>>>> the
>> > > >>>>>>>>>>>> use
>> > > >>>>>>>>>>>>> of
>> > > >>>>>>>>>>>>>>>> additional threads is actually optional. There appear
>> > > >>> to
>> > > >>>> be
>> > > >>>>>> two
>> > > >>>>>>>>>>>>> general
>> > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests
>> > > >>>>>> (follower) /
>> > > >>>>>>>>>>>> fetch
>> > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2)
>> > > >>> delay
>> > > >>>>>> them,
>> > > >>>>>>>>>>> as
>> > > >>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate
>> > > >> fetchers.
>> > > >>>>> Both
>> > > >>>>>>>>>>> appear
>> > > >>>>>>>>>>>>>>> valid,
>> > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs.
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs
>> > > >> somewhat
>> > > >>>>> from
>> > > >>>>>>> the
>> > > >>>>>>>>>>>>>> existing
>> > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of
>> > > >>>>> fairness
>> > > >>>>>>>>>>>> within,
>> > > >>>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue
>> > > >> with
>> > > >>>> (2)
>> > > >>>>> is
>> > > >>>>>>>>>>>>>>> guaranteeing
>> > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads, but
>> > > >>> this
>> > > >>>>> is
>> > > >>>>>>>>>>>> handled,
>> > > >>>>>>>>>>>>>> for
>> > > >>>>>>>>>>>>>>>> the most part, in the code today.
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to
>> > > >> make
>> > > >>>>> this a
>> > > >>>>>>>>>>>> little
>> > > >>>>>>>>>>>>>>>> clearer.
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>> B
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <
>> > > >>>> [email protected] <javascript:;>>
>> > > >>>>>>>>>>> wrote:
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Hi Ben,
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal
>> > > >>>> involves
>> > > >>>>>>>>>>>>>>>> self-throttling
>> > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader.
>> > > >> Can
>> > > >>>> you
>> > > >>>>>>>>>>>> elaborate
>> > > >>>>>>>>>>>>>> on
>> > > >>>>>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The throttle
>> > > >> is
>> > > >>>>>> applied
>> > > >>>>>>>>>>> to
>> > > >>>>>>>>>>>>>> both
>> > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to
>> > > >> exert
>> > > >>>>> strong
>> > > >>>>>>>>>>>>>> guarantees
>> > > >>>>>>>>>>>>>>>> on
>> > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one or
>> > > >>> the
>> > > >>>>>> other
>> > > >>>>>>>>>>>>>> wouldn't
>> > > >>>>>>>>>>>>>>>> be
>> > > >>>>>>>>>>>>>>>>> sufficient.
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do self-throttling
>> > > >> on
>> > > >>>> the
>> > > >>>>>>>>>>>>> fetchers,
>> > > >>>>>>>>>>>>>> we
>> > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica
>> > > >>> fetchers
>> > > >>>>>> right?
>> > > >>>>>>>>>>>>> i.e.,
>> > > >>>>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics as
>> > > >>> you
>> > > >>>>>>>>>>> proposed
>> > > >>>>>>>>>>>>> and
>> > > >>>>>>>>>>>>>>>> each
>> > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to
>> > > >> make
>> > > >>>>>>> progress
>> > > >>>>>>>>>>>> for
>> > > >>>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective
>> > > >>>>> consumption
>> > > >>>>>>>>>>> rate
>> > > >>>>>>>>>>>> is
>> > > >>>>>>>>>>>>>>> below
>> > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption rate
>> > > >>> then
>> > > >>>>>> don’t
>> > > >>>>>>>>>>>>>> include
>> > > >>>>>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch
>> > > >> requests
>> > > >>>>> until
>> > > >>>>>>> the
>> > > >>>>>>>>>>>>>>> effective
>> > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to
>> > > >>> within
>> > > >>>>> the
>> > > >>>>>>>>>>> quota
>> > > >>>>>>>>>>>>>>>> threshold.
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was more
>> > > >>>>>> interested
>> > > >>>>>>>>>>> in
>> > > >>>>>>>>>>>>> the
>> > > >>>>>>>>>>>>>>>> above
>> > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit.
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc that
>> > > >>> you
>> > > >>>>> link
>> > > >>>>>>>>>>> to?
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Thanks,
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> Joel
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <
>> > > >>>>>>> [email protected] <javascript:;>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>>>> wrote:
>> > > >>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving
>> > > >>>>> replicas.
>> > > >>>>>>>>>>> Full
>> > > >>>>>>>>>>>>>>> details
>> > > >>>>>>>>>>>>>>>>>> are here:
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>> > > >>> confluence/display/KAFKA/KIP-
>> > > >>>>> 73+
>> > > >>>>>>>>>>>>>>>>>> Replication+Quotas <https://cwiki.apache.org/conf
>> > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts.
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> Thanks
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>> B
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>> --
>> > > >>>>>>>>>>>>> -Regards,
>> > > >>>>>>>>>>>>> Mayuresh R. Gharat
>> > > >>>>>>>>>>>>> (862) 250-7125
>> > > >>>>>>>>>>>>>
>> > > >>>>>>>>>>>>
>> > > >>>>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>> --
>> > > >>>>>>>> -Regards,
>> > > >>>>>>>> Mayuresh R. Gharat
>> > > >>>>>>>> (862) 250-7125
>> > > >>>>>>>>
>> > > >>>>>>>
>> > > >>>>>>
>> > > >>>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> --
>> > > >>>> -Regards,
>> > > >>>> Mayuresh R. Gharat
>> > > >>>> (862) 250-7125
>> > > >>>>
>> > > >>>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Ben Stopford
>> > > >>
>> > >
>> > >
>> >
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] KIP-73: Replication Quotas

Reply via email to