Re: [DISCUSS] KIP-73: Replication Quotas

Jun Rao Fri, 12 Aug 2016 02:11:44 -0700

Mayuresh,

I was thinking of the following.


If P1 has data and P2 is throttled, we will return empty data for P2 and
send the response back immediately. The follower will issue the next fetch
request immediately, but the leader won't return any data in P2 until the
quota is not exceeded. We are not delaying the fetch requests here.
However, there is no additional overhead compared with no throttling since
P1 always has data.

If P1 has no data and P2 is throttled, the leader will return empty data
for both P1 and P2 after waiting in the Purgatory up to max.wait. This
prevents the follower from getting empty responses too frequently.

Thanks,

Jun

On Thu, Aug 11, 2016 at 5:33 PM, Mayuresh Gharat <[email protected]
> wrote:

> Hi Jun,
>
> Correct me if I am wrong.
> If the response size includes throttled and unthrottled replicas, I am
> thinking if this is possible :
> The leader broker B1 receives a fetch request partition P1 and P2 for a
> topic from replica broker B2. In this case lets say that only P2 is
> throttled on the leader and P1 is not. In that case we will add the data
> for P1 in the response in which case the min.bytes threshold will be
> crossed and the response will be returned back right?
> If we say that with this kip, we will throttle this fetch request entirely,
> then we are essentially delaying response for partition P1 which is not the
> throttled partition.
>
> Is it fair to say we can indicate to the follower in the fetch response,
> how much time it should wait till it adds back a fetch request for
> partition P2.
>
> Thanks,
>
> Mayuresh
>
> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <[email protected]> wrote:
>
> > Hi, Joel,
> >
> > Yes, the response size includes both throttled and unthrottled replicas.
> > However, the response is only delayed up to max.wait if the response size
> > is less than min.bytes, which matches the current behavior. So, there is
> no
> > extra delay to due throttling, right? For replica fetchers, the default
> > min.byte is 1. So, the response is only delayed if there is no byte in
> the
> > response, which is what we want.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <[email protected]>
> wrote:
> >
> > > Hi Jun,
> > >
> > > I'm not sure that would work unless we have separate replica fetchers,
> > > since this would cause all replicas (including ones that are not
> > throttled)
> > > to get delayed. Instead, we could just have the leader populate the
> > > throttle-time field of the response as a hint to the follower as to how
> > > long it should wait before it adds those replicas back to its
> subsequent
> > > replica fetch requests.
> > >
> > > Thanks,
> > >
> > > Joel
> > >
> > > On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <[email protected]> wrote:
> > >
> > > > Mayuresh,
> > > >
> > > > That's a good question. I think if the response size (after leader
> > > > throttling) is smaller than min.bytes, we will just delay the sending
> > of
> > > > the response up to max.wait as we do now. This should prevent
> frequent
> > > > empty responses to the follower.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat <
> > > > [email protected]
> > > > > wrote:
> > > >
> > > > > This might have been answered before.
> > > > > I was wondering when the leader quota is reached and it sends empty
> > > > > response ( If the inclusion of a partition, listed in the leader's
> > > > > throttled-replicas list, causes the LeaderQuotaRate to be exceeded,
> > > that
> > > > > partition is omitted from the response (aka returns 0 bytes).). At
> > this
> > > > > point the follower quota is NOT reached and the follower is still
> > going
> > > > to
> > > > > ask for the that partition in the next fetch request. Would it be
> > fair
> > > to
> > > > > add some logic there so that the follower backs off ( for some
> > > > configurable
> > > > > time) from including those partitions in the next fetch request?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Mayuresh
> > > > >
> > > > > On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <[email protected]>
> > > wrote:
> > > > >
> > > > > > Thanks again for the responses everyone. I’ve removed the the
> extra
> > > > > > fetcher threads from the proposal, switching to the
> inclusion-based
> > > > > > approach. The relevant section is:
> > > > > >
> > > > > > The follower makes a requests, using the fixed size of
> > > > > > replica.fetch.response.max.bytes as per KIP-74 <
> > > > > https://cwiki.apache.org/
> > > > > > confluence/display/KAFKA/KIP-74%3A+Add+Fetch+Response+Size+
> > > > > Limit+in+Bytes>.
> > > > > > The order of the partitions in the fetch request are randomised
> to
> > > > ensure
> > > > > > fairness.
> > > > > > When the leader receives the fetch request it processes the
> > > partitions
> > > > in
> > > > > > the defined order, up to the response's size limit. If the
> > inclusion
> > > > of a
> > > > > > partition, listed in the leader's throttled-replicas list, causes
> > the
> > > > > > LeaderQuotaRate to be exceeded, that partition is omitted from
> the
> > > > > response
> > > > > > (aka returns 0 bytes). Logically, this is of the form:
> > > > > > var bytesAllowedForThrottledPartition =
> > quota.recordAndMaybeAdjust(
> > > > > > bytesRequestedForPartition)
> > > > > > When the follower receives the fetch response, if it includes
> > > > partitions
> > > > > > in its throttled-partitions list, it increments the
> > > FollowerQuotaRate:
> > > > > > var includeThrottledPartitionsInNextRequest: Boolean =
> > > > > > quota.recordAndEvaluate(previousResponseThrottledBytes)
> > > > > > If the quota is exceeded, no throttled partitions will be
> included
> > in
> > > > the
> > > > > > next fetch request emitted by this replica fetcher thread.
> > > > > >
> > > > > > B
> > > > > >
> > > > > > > On 9 Aug 2016, at 23:34, Jun Rao <[email protected]> wrote:
> > > > > > >
> > > > > > > When there are several unthrottled replicas, we could also just
> > do
> > > > > what's
> > > > > > > suggested in KIP-74. The client is responsible for reordering
> the
> > > > > > > partitions and the leader fills in the bytes to those
> partitions
> > in
> > > > > > order,
> > > > > > > up to the quota limit.
> > > > > > >
> > > > > > > We could also do what you suggested. If quota is exceeded,
> > include
> > > > > empty
> > > > > > > data in the response for throttled replicas. Keep doing that
> > until
> > > > > enough
> > > > > > > time has passed so that the quota is no longer exceeded. This
> > > > > potentially
> > > > > > > allows better batching per partition. Not sure if the two
> makes a
> > > big
> > > > > > > difference in practice though.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On the leader side, one challenge is related to the fairness
> > > issue
> > > > > that
> > > > > > >> Ben
> > > > > > >>> brought up. The question is what if the fetch response limit
> is
> > > > > filled
> > > > > > up
> > > > > > >>> by the throttled replicas? If this happens constantly, we
> will
> > > > delay
> > > > > > the
> > > > > > >>> progress of those un-throttled replicas. However, I think we
> > can
> > > > > > address
> > > > > > >>> this issue by trying to fill up the unthrottled replicas in
> the
> > > > > > response
> > > > > > >>> first. So, the algorithm would be. Fill up unthrottled
> replicas
> > > up
> > > > to
> > > > > > the
> > > > > > >>> fetch response limit. If there is space left, fill up
> throttled
> > > > > > replicas.
> > > > > > >>> If quota is exceeded for the throttled replicas, reduce the
> > bytes
> > > > in
> > > > > > the
> > > > > > >>> throttled replicas in the response accordingly.
> > > > > > >>>
> > > > > > >>
> > > > > > >> Right - that's what I was trying to convey by truncation (vs
> > > empty).
> > > > > So
> > > > > > we
> > > > > > >> would attempt to fill the response for throttled partitions as
> > > much
> > > > as
> > > > > > we
> > > > > > >> can before hitting the quota limit. There is one more detail
> to
> > > > handle
> > > > > > in
> > > > > > >> this: if there are several throttled partitions and not enough
> > > > > remaining
> > > > > > >> allowance in the fetch response to include all the throttled
> > > > replicas
> > > > > > then
> > > > > > >> we would need to decide which of those partitions get a share;
> > > which
> > > > > is
> > > > > > why
> > > > > > >> I'm wondering if it is easier to return empty for those
> > partitions
> > > > > > entirely
> > > > > > >> in the fetch response - they will make progress in the
> > subsequent
> > > > > > fetch. If
> > > > > > >> they don't make fast enough progress then that would be a case
> > for
> > > > > > raising
> > > > > > >> the threshold or letting it complete at an off-peak time.
> > > > > > >>
> > > > > > >>
> > > > > > >>>
> > > > > > >>> With this approach, we need some new logic to handle
> throttling
> > > on
> > > > > the
> > > > > > >>> leader, but we can leave the replica threading model
> unchanged.
> > > So,
> > > > > > >>> overall, this still seems to be a simpler approach.
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>>
> > > > > > >>> Jun
> > > > > > >>>
> > > > > > >>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
> > > > > > >>> [email protected]
> > > > > > >>>> wrote:
> > > > > > >>>
> > > > > > >>>> Nice write up Ben.
> > > > > > >>>>
> > > > > > >>>> I agree with Joel for keeping this simple by excluding the
> > > > > partitions
> > > > > > >>> from
> > > > > > >>>> the fetch request/response when the quota is violated at the
> > > > > follower
> > > > > > >> or
> > > > > > >>>> leader instead of having a separate set of threads for
> > handling
> > > > the
> > > > > > >> quota
> > > > > > >>>> and non quota cases. Even though its different from the
> > current
> > > > > quota
> > > > > > >>>> implementation it should be OK since its internal to brokers
> > and
> > > > can
> > > > > > be
> > > > > > >>>> handled by tuning the quota configs for it appropriately by
> > the
> > > > > > admins.
> > > > > > >>>>
> > > > > > >>>> Also can you elaborate with an example how this would be
> > > handled :
> > > > > > >>>> *guaranteeing
> > > > > > >>>> ordering of updates when replicas shift threads*
> > > > > > >>>>
> > > > > > >>>> Thanks,
> > > > > > >>>>
> > > > > > >>>> Mayuresh
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <
> > > [email protected]>
> > > > > > >> wrote:
> > > > > > >>>>
> > > > > > >>>>> On the need for both leader/follower throttling: that makes
> > > > sense -
> > > > > > >>>> thanks
> > > > > > >>>>> for clarifying. For completeness, can we add this detail to
> > the
> > > > > doc -
> > > > > > >>>> say,
> > > > > > >>>>> after the quote that I pasted earlier?
> > > > > > >>>>>
> > > > > > >>>>> From an implementation perspective though: I’m still
> > interested
> > > > in
> > > > > > >> the
> > > > > > >>>>> simplicity of not having to add separate replica fetchers,
> > > delay
> > > > > > >> queue
> > > > > > >>> on
> > > > > > >>>>> the leader, and “move” partitions from the throttled
> replica
> > > > > fetchers
> > > > > > >>> to
> > > > > > >>>>> the regular replica fetchers once caught up.
> > > > > > >>>>>
> > > > > > >>>>> Instead, I think it would work and be simpler to include or
> > > > exclude
> > > > > > >> the
> > > > > > >>>>> partitions in the fetch request from the follower and fetch
> > > > > response
> > > > > > >>> from
> > > > > > >>>>> the leader when the quota is violated. The issue of
> fairness
> > > that
> > > > > Ben
> > > > > > >>>> noted
> > > > > > >>>>> may be a wash between the two options (that Ben wrote in
> his
> > > > > email).
> > > > > > >>> With
> > > > > > >>>>> the default quota delay mechanism, partitions get delayed
> > > > > essentially
> > > > > > >>> at
> > > > > > >>>>> random - i.e., whoever fetches at the time of quota
> violation
> > > > gets
> > > > > > >>>> delayed
> > > > > > >>>>> at the leader. So we can adopt a similar policy in choosing
> > to
> > > > > > >> truncate
> > > > > > >>>>> partitions in fetch responses. i.e., if at the time of
> > handling
> > > > the
> > > > > > >>> fetch
> > > > > > >>>>> the “effect” replication rate exceeds the quota then either
> > > empty
> > > > > or
> > > > > > >>>>> truncate those partitions from the response. (BTW effect
> > > > > replication
> > > > > > >> is
> > > > > > >>>>> your terminology in the wiki - i.e., replication due to
> > > partition
> > > > > > >>>>> reassignment, adding brokers, etc.)
> > > > > > >>>>>
> > > > > > >>>>> While this may be slightly different from the existing
> quota
> > > > > > >> mechanism
> > > > > > >>> I
> > > > > > >>>>> think the difference is small (since we would reuse the
> quota
> > > > > manager
> > > > > > >>> at
> > > > > > >>>>> worst with some refactoring) and will be internal to the
> > > broker.
> > > > > > >>>>>
> > > > > > >>>>> So I guess the question is if this alternative is simpler
> > > enough
> > > > > and
> > > > > > >>>>> equally functional to not go with dedicated throttled
> replica
> > > > > > >> fetchers.
> > > > > > >>>>>
> > > > > > >>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <[email protected]>
> > > > wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Just to elaborate on what Ben said why we need throttling
> on
> > > > both
> > > > > > >> the
> > > > > > >>>>>> leader and the follower side.
> > > > > > >>>>>>
> > > > > > >>>>>> If we only have throttling on the follower side, consider
> a
> > > case
> > > > > > >> that
> > > > > > >>>> we
> > > > > > >>>>>> add 5 more new brokers and want to move some replicas from
> > > > > existing
> > > > > > >>>>> brokers
> > > > > > >>>>>> over to those 5 brokers. Each of those broker is going to
> > > fetch
> > > > > > >> data
> > > > > > >>>> from
> > > > > > >>>>>> all existing brokers. Then, it's possible that the
> > aggregated
> > > > > fetch
> > > > > > >>>> load
> > > > > > >>>>>> from those 5 brokers on a particular existing broker
> exceeds
> > > its
> > > > > > >>>> outgoing
> > > > > > >>>>>> network bandwidth, even though the inbounding traffic on
> > each
> > > of
> > > > > > >>> those
> > > > > > >>>> 5
> > > > > > >>>>>> brokers is bounded.
> > > > > > >>>>>>
> > > > > > >>>>>> If we only have throttling on the leader side, consider
> the
> > > same
> > > > > > >>>> example
> > > > > > >>>>>> above. It's possible for the incoming traffic to each of
> > > those 5
> > > > > > >>>> brokers
> > > > > > >>>>> to
> > > > > > >>>>>> exceed its network bandwidth since it is fetching data
> from
> > > all
> > > > > > >>>> existing
> > > > > > >>>>>> brokers.
> > > > > > >>>>>>
> > > > > > >>>>>> So, being able to set a quota on both the follower and the
> > > > leader
> > > > > > >>> side
> > > > > > >>>>>> protects both cases.
> > > > > > >>>>>>
> > > > > > >>>>>> Thanks,
> > > > > > >>>>>>
> > > > > > >>>>>> Jun
> > > > > > >>>>>>
> > > > > > >>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <
> > > [email protected]>
> > > > > > >>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Hi Joel
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks for taking the time to look at this. Appreciated.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Regarding throttling on both leader and follower, this
> > > proposal
> > > > > > >>>> covers
> > > > > > >>>>> a
> > > > > > >>>>>>> more general solution which can guarantee a quota, even
> > when
> > > a
> > > > > > >>>>> rebalance
> > > > > > >>>>>>> operation produces an asymmetric profile of load. This
> > means
> > > > > > >>>>>> administrators
> > > > > > >>>>>>> don’t need to calculate the impact that a follower-only
> > quota
> > > > > > >> will
> > > > > > >>>> have
> > > > > > >>>>>> on
> > > > > > >>>>>>> the leaders they are fetching from. So for example where
> > > > replica
> > > > > > >>>> sizes
> > > > > > >>>>>> are
> > > > > > >>>>>>> skewed or where a partial rebalance is required.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Having said that, even with both leader and follower
> > quotas,
> > > > the
> > > > > > >>> use
> > > > > > >>>> of
> > > > > > >>>>>>> additional threads is actually optional. There appear to
> be
> > > two
> > > > > > >>>> general
> > > > > > >>>>>>> approaches (1) omit partitions from fetch requests
> > > (follower) /
> > > > > > >>> fetch
> > > > > > >>>>>>> responses (leader) when they exceed their quota (2) delay
> > > them,
> > > > > > >> as
> > > > > > >>>> the
> > > > > > >>>>>>> existing quota mechanism does, using separate fetchers.
> > Both
> > > > > > >> appear
> > > > > > >>>>>> valid,
> > > > > > >>>>>>> but with slightly different design tradeoffs.
> > > > > > >>>>>>>
> > > > > > >>>>>>> The issue with approach (1) is that it departs somewhat
> > from
> > > > the
> > > > > > >>>>> existing
> > > > > > >>>>>>> quotas implementation, and must include a notion of
> > fairness
> > > > > > >>> within,
> > > > > > >>>>> the
> > > > > > >>>>>>> now size-bounded, request and response. The issue with
> (2)
> > is
> > > > > > >>>>>> guaranteeing
> > > > > > >>>>>>> ordering of updates when replicas shift threads, but this
> > is
> > > > > > >>> handled,
> > > > > > >>>>> for
> > > > > > >>>>>>> the most part, in the code today.
> > > > > > >>>>>>>
> > > > > > >>>>>>> I’ve updated the rejected alternatives section to make
> > this a
> > > > > > >>> little
> > > > > > >>>>>>> clearer.
> > > > > > >>>>>>>
> > > > > > >>>>>>> B
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <
> [email protected]>
> > > > > > >> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Hi Ben,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thanks for the detailed write-up. So the proposal
> involves
> > > > > > >>>>>>> self-throttling
> > > > > > >>>>>>>> on the fetcher side and throttling at the leader. Can
> you
> > > > > > >>> elaborate
> > > > > > >>>>> on
> > > > > > >>>>>>> the
> > > > > > >>>>>>>> reasoning that is given on the wiki: *“The throttle is
> > > applied
> > > > > > >> to
> > > > > > >>>>> both
> > > > > > >>>>>>>> leaders and followers. This allows the admin to exert
> > strong
> > > > > > >>>>> guarantees
> > > > > > >>>>>>> on
> > > > > > >>>>>>>> the throttle limit".* Is there any reason why one or the
> > > other
> > > > > > >>>>> wouldn't
> > > > > > >>>>>>> be
> > > > > > >>>>>>>> sufficient.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Specifically, if we were to only do self-throttling on
> the
> > > > > > >>>> fetchers,
> > > > > > >>>>> we
> > > > > > >>>>>>>> could potentially avoid the additional replica fetchers
> > > right?
> > > > > > >>>> i.e.,
> > > > > > >>>>>> the
> > > > > > >>>>>>>> replica fetchers would maintain its quota metrics as you
> > > > > > >> proposed
> > > > > > >>>> and
> > > > > > >>>>>>> each
> > > > > > >>>>>>>> (normal) replica fetch presents an opportunity to make
> > > > progress
> > > > > > >>> for
> > > > > > >>>>> the
> > > > > > >>>>>>>> throttled partitions as long as their effective
> > consumption
> > > > > > >> rate
> > > > > > >>> is
> > > > > > >>>>>> below
> > > > > > >>>>>>>> the quota limit. If it exceeds the consumption rate then
> > > don’t
> > > > > > >>>>> include
> > > > > > >>>>>>> the
> > > > > > >>>>>>>> throttled partitions in the subsequent fetch requests
> > until
> > > > the
> > > > > > >>>>>> effective
> > > > > > >>>>>>>> consumption rate for those partitions returns to within
> > the
> > > > > > >> quota
> > > > > > >>>>>>> threshold.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I have more questions on the proposal, but was more
> > > interested
> > > > > > >> in
> > > > > > >>>> the
> > > > > > >>>>>>> above
> > > > > > >>>>>>>> to see if it could simplify things a bit.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Also, can you open up access to the google-doc that you
> > link
> > > > > > >> to?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Thanks,
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Joel
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <
> > > > [email protected]
> > > > > > >>>
> > > > > > >>>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> We’ve created KIP-73: Replication Quotas
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The idea is to allow an admin to throttle moving
> > replicas.
> > > > > > >> Full
> > > > > > >>>>>> details
> > > > > > >>>>>>>>> are here:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 73+
> > > > > > >>>>>>>>> Replication+Quotas <https://cwiki.apache.org/conf
> > > > > > >>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Please take a look and let us know your thoughts.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Thanks
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> B
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>> -Regards,
> > > > > > >>>> Mayuresh R. Gharat
> > > > > > >>>> (862) 250-7125
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -Regards,
> > > > > Mayuresh R. Gharat
> > > > > (862) 250-7125
> > > > >
> > > >
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: [DISCUSS] KIP-73: Replication Quotas

Reply via email to