Re: [DISCUSS] KIP-73: Replication Quotas

Ben Stopford Thu, 11 Aug 2016 00:33:27 -0700

Hi Mayureesh

That’s a good question and something that should be covered.


I think if the leader throttles partitions so the response gets too small, it 
should be automatically delayed in Purgatory. Likewise, on the follower, if the 
request is contains no partitions the request will be again be delayed 
automatically. There is some code in the fetchers to do this based on the 
setting replica.fetch.backoff.ms. 

B


> On 11 Aug 2016, at 05:17, Mayuresh Gharat <gharatmayures...@gmail.com> wrote:
> 
> This might have been answered before.
> I was wondering when the leader quota is reached and it sends empty
> response ( If the inclusion of a partition, listed in the leader's
> throttled-replicas list, causes the LeaderQuotaRate to be exceeded, that
> partition is omitted from the response (aka returns 0 bytes).). At this
> point the follower quota is NOT reached and the follower is still going to
> ask for the that partition in the next fetch request. Would it be fair to
> add some logic there so that the follower backs off ( for some configurable
> time) from including those partitions in the next fetch request?
> 
> Thanks,
> 
> Mayuresh
> 
> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <b...@confluent.io> wrote:
> 
>> Thanks again for the responses everyone. I’ve removed the the extra
>> fetcher threads from the proposal, switching to the inclusion-based
>> approach. The relevant section is:
>> 
>> The follower makes a requests, using the fixed size of
>> replica.fetch.response.max.bytes as per KIP-74 <https://cwiki.apache.org/
>> confluence/display/KAFKA/KIP-74%3A+Add+Fetch+Response+Size+Limit+in+Bytes>.
>> The order of the partitions in the fetch request are randomised to ensure
>> fairness.
>> When the leader receives the fetch request it processes the partitions in
>> the defined order, up to the response's size limit. If the inclusion of a
>> partition, listed in the leader's throttled-replicas list, causes the
>> LeaderQuotaRate to be exceeded, that partition is omitted from the response
>> (aka returns 0 bytes). Logically, this is of the form:
>> var bytesAllowedForThrottledPartition = quota.recordAndMaybeAdjust(
>> bytesRequestedForPartition)
>> When the follower receives the fetch response, if it includes partitions
>> in its throttled-partitions list, it increments the FollowerQuotaRate:
>> var includeThrottledPartitionsInNextRequest: Boolean =
>> quota.recordAndEvaluate(previousResponseThrottledBytes)
>> If the quota is exceeded, no throttled partitions will be included in the
>> next fetch request emitted by this replica fetcher thread.
>> 
>> B
>> 
>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io> wrote:
>>> 
>>> When there are several unthrottled replicas, we could also just do what's
>>> suggested in KIP-74. The client is responsible for reordering the
>>> partitions and the leader fills in the bytes to those partitions in
>> order,
>>> up to the quota limit.
>>> 
>>> We could also do what you suggested. If quota is exceeded, include empty
>>> data in the response for throttled replicas. Keep doing that until enough
>>> time has passed so that the quota is no longer exceeded. This potentially
>>> allows better batching per partition. Not sure if the two makes a big
>>> difference in practice though.
>>> 
>>> Thanks,
>>> 
>>> Jun
>>> 
>>> 
>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <jjkosh...@gmail.com> wrote:
>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On the leader side, one challenge is related to the fairness issue that
>>>> Ben
>>>>> brought up. The question is what if the fetch response limit is filled
>> up
>>>>> by the throttled replicas? If this happens constantly, we will delay
>> the
>>>>> progress of those un-throttled replicas. However, I think we can
>> address
>>>>> this issue by trying to fill up the unthrottled replicas in the
>> response
>>>>> first. So, the algorithm would be. Fill up unthrottled replicas up to
>> the
>>>>> fetch response limit. If there is space left, fill up throttled
>> replicas.
>>>>> If quota is exceeded for the throttled replicas, reduce the bytes in
>> the
>>>>> throttled replicas in the response accordingly.
>>>>> 
>>>> 
>>>> Right - that's what I was trying to convey by truncation (vs empty). So
>> we
>>>> would attempt to fill the response for throttled partitions as much as
>> we
>>>> can before hitting the quota limit. There is one more detail to handle
>> in
>>>> this: if there are several throttled partitions and not enough remaining
>>>> allowance in the fetch response to include all the throttled replicas
>> then
>>>> we would need to decide which of those partitions get a share; which is
>> why
>>>> I'm wondering if it is easier to return empty for those partitions
>> entirely
>>>> in the fetch response - they will make progress in the subsequent
>> fetch. If
>>>> they don't make fast enough progress then that would be a case for
>> raising
>>>> the threshold or letting it complete at an off-peak time.
>>>> 
>>>> 
>>>>> 
>>>>> With this approach, we need some new logic to handle throttling on the
>>>>> leader, but we can leave the replica threading model unchanged. So,
>>>>> overall, this still seems to be a simpler approach.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jun
>>>>> 
>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
>>>>> gharatmayures...@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> Nice write up Ben.
>>>>>> 
>>>>>> I agree with Joel for keeping this simple by excluding the partitions
>>>>> from
>>>>>> the fetch request/response when the quota is violated at the follower
>>>> or
>>>>>> leader instead of having a separate set of threads for handling the
>>>> quota
>>>>>> and non quota cases. Even though its different from the current quota
>>>>>> implementation it should be OK since its internal to brokers and can
>> be
>>>>>> handled by tuning the quota configs for it appropriately by the
>> admins.
>>>>>> 
>>>>>> Also can you elaborate with an example how this would be handled :
>>>>>> *guaranteeing
>>>>>> ordering of updates when replicas shift threads*
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Mayuresh
>>>>>> 
>>>>>> 
>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <jjkosh...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> On the need for both leader/follower throttling: that makes sense -
>>>>>> thanks
>>>>>>> for clarifying. For completeness, can we add this detail to the doc -
>>>>>> say,
>>>>>>> after the quote that I pasted earlier?
>>>>>>> 
>>>>>>> From an implementation perspective though: I’m still interested in
>>>> the
>>>>>>> simplicity of not having to add separate replica fetchers, delay
>>>> queue
>>>>> on
>>>>>>> the leader, and “move” partitions from the throttled replica fetchers
>>>>> to
>>>>>>> the regular replica fetchers once caught up.
>>>>>>> 
>>>>>>> Instead, I think it would work and be simpler to include or exclude
>>>> the
>>>>>>> partitions in the fetch request from the follower and fetch response
>>>>> from
>>>>>>> the leader when the quota is violated. The issue of fairness that Ben
>>>>>> noted
>>>>>>> may be a wash between the two options (that Ben wrote in his email).
>>>>> With
>>>>>>> the default quota delay mechanism, partitions get delayed essentially
>>>>> at
>>>>>>> random - i.e., whoever fetches at the time of quota violation gets
>>>>>> delayed
>>>>>>> at the leader. So we can adopt a similar policy in choosing to
>>>> truncate
>>>>>>> partitions in fetch responses. i.e., if at the time of handling the
>>>>> fetch
>>>>>>> the “effect” replication rate exceeds the quota then either empty or
>>>>>>> truncate those partitions from the response. (BTW effect replication
>>>> is
>>>>>>> your terminology in the wiki - i.e., replication due to partition
>>>>>>> reassignment, adding brokers, etc.)
>>>>>>> 
>>>>>>> While this may be slightly different from the existing quota
>>>> mechanism
>>>>> I
>>>>>>> think the difference is small (since we would reuse the quota manager
>>>>> at
>>>>>>> worst with some refactoring) and will be internal to the broker.
>>>>>>> 
>>>>>>> So I guess the question is if this alternative is simpler enough and
>>>>>>> equally functional to not go with dedicated throttled replica
>>>> fetchers.
>>>>>>> 
>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <j...@confluent.io> wrote:
>>>>>>> 
>>>>>>>> Just to elaborate on what Ben said why we need throttling on both
>>>> the
>>>>>>>> leader and the follower side.
>>>>>>>> 
>>>>>>>> If we only have throttling on the follower side, consider a case
>>>> that
>>>>>> we
>>>>>>>> add 5 more new brokers and want to move some replicas from existing
>>>>>>> brokers
>>>>>>>> over to those 5 brokers. Each of those broker is going to fetch
>>>> data
>>>>>> from
>>>>>>>> all existing brokers. Then, it's possible that the aggregated fetch
>>>>>> load
>>>>>>>> from those 5 brokers on a particular existing broker exceeds its
>>>>>> outgoing
>>>>>>>> network bandwidth, even though the inbounding traffic on each of
>>>>> those
>>>>>> 5
>>>>>>>> brokers is bounded.
>>>>>>>> 
>>>>>>>> If we only have throttling on the leader side, consider the same
>>>>>> example
>>>>>>>> above. It's possible for the incoming traffic to each of those 5
>>>>>> brokers
>>>>>>> to
>>>>>>>> exceed its network bandwidth since it is fetching data from all
>>>>>> existing
>>>>>>>> brokers.
>>>>>>>> 
>>>>>>>> So, being able to set a quota on both the follower and the leader
>>>>> side
>>>>>>>> protects both cases.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Jun
>>>>>>>> 
>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <b...@confluent.io>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Joel
>>>>>>>>> 
>>>>>>>>> Thanks for taking the time to look at this. Appreciated.
>>>>>>>>> 
>>>>>>>>> Regarding throttling on both leader and follower, this proposal
>>>>>> covers
>>>>>>> a
>>>>>>>>> more general solution which can guarantee a quota, even when a
>>>>>>> rebalance
>>>>>>>>> operation produces an asymmetric profile of load. This means
>>>>>>>> administrators
>>>>>>>>> don’t need to calculate the impact that a follower-only quota
>>>> will
>>>>>> have
>>>>>>>> on
>>>>>>>>> the leaders they are fetching from. So for example where replica
>>>>>> sizes
>>>>>>>> are
>>>>>>>>> skewed or where a partial rebalance is required.
>>>>>>>>> 
>>>>>>>>> Having said that, even with both leader and follower quotas, the
>>>>> use
>>>>>> of
>>>>>>>>> additional threads is actually optional. There appear to be two
>>>>>> general
>>>>>>>>> approaches (1) omit partitions from fetch requests (follower) /
>>>>> fetch
>>>>>>>>> responses (leader) when they exceed their quota (2) delay them,
>>>> as
>>>>>> the
>>>>>>>>> existing quota mechanism does, using separate fetchers. Both
>>>> appear
>>>>>>>> valid,
>>>>>>>>> but with slightly different design tradeoffs.
>>>>>>>>> 
>>>>>>>>> The issue with approach (1) is that it departs somewhat from the
>>>>>>> existing
>>>>>>>>> quotas implementation, and must include a notion of fairness
>>>>> within,
>>>>>>> the
>>>>>>>>> now size-bounded, request and response. The issue with (2) is
>>>>>>>> guaranteeing
>>>>>>>>> ordering of updates when replicas shift threads, but this is
>>>>> handled,
>>>>>>> for
>>>>>>>>> the most part, in the code today.
>>>>>>>>> 
>>>>>>>>> I’ve updated the rejected alternatives section to make this a
>>>>> little
>>>>>>>>> clearer.
>>>>>>>>> 
>>>>>>>>> B
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <jjkosh...@gmail.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ben,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the detailed write-up. So the proposal involves
>>>>>>>>> self-throttling
>>>>>>>>>> on the fetcher side and throttling at the leader. Can you
>>>>> elaborate
>>>>>>> on
>>>>>>>>> the
>>>>>>>>>> reasoning that is given on the wiki: *“The throttle is applied
>>>> to
>>>>>>> both
>>>>>>>>>> leaders and followers. This allows the admin to exert strong
>>>>>>> guarantees
>>>>>>>>> on
>>>>>>>>>> the throttle limit".* Is there any reason why one or the other
>>>>>>> wouldn't
>>>>>>>>> be
>>>>>>>>>> sufficient.
>>>>>>>>>> 
>>>>>>>>>> Specifically, if we were to only do self-throttling on the
>>>>>> fetchers,
>>>>>>> we
>>>>>>>>>> could potentially avoid the additional replica fetchers right?
>>>>>> i.e.,
>>>>>>>> the
>>>>>>>>>> replica fetchers would maintain its quota metrics as you
>>>> proposed
>>>>>> and
>>>>>>>>> each
>>>>>>>>>> (normal) replica fetch presents an opportunity to make progress
>>>>> for
>>>>>>> the
>>>>>>>>>> throttled partitions as long as their effective consumption
>>>> rate
>>>>> is
>>>>>>>> below
>>>>>>>>>> the quota limit. If it exceeds the consumption rate then don’t
>>>>>>> include
>>>>>>>>> the
>>>>>>>>>> throttled partitions in the subsequent fetch requests until the
>>>>>>>> effective
>>>>>>>>>> consumption rate for those partitions returns to within the
>>>> quota
>>>>>>>>> threshold.
>>>>>>>>>> 
>>>>>>>>>> I have more questions on the proposal, but was more interested
>>>> in
>>>>>> the
>>>>>>>>> above
>>>>>>>>>> to see if it could simplify things a bit.
>>>>>>>>>> 
>>>>>>>>>> Also, can you open up access to the google-doc that you link
>>>> to?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Joel
>>>>>>>>>> 
>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <b...@confluent.io
>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> We’ve created KIP-73: Replication Quotas
>>>>>>>>>>> 
>>>>>>>>>>> The idea is to allow an admin to throttle moving replicas.
>>>> Full
>>>>>>>> details
>>>>>>>>>>> are here:
>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+
>>>>>>>>>>> Replication+Quotas <https://cwiki.apache.org/conf
>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
>>>>>>>>>>> 
>>>>>>>>>>> Please take a look and let us know your thoughts.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> B
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> -Regards,
>>>>>> Mayuresh R. Gharat
>>>>>> (862) 250-7125
>>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125

Re: [DISCUSS] KIP-73: Replication Quotas

Reply via email to