Hi Harry, Thanks for the updates!
Yes, the proposed metric looks good. If the user runs the kafka-reassign-partitions script with throttle set, then the static throttle gets overwritten until the reassignment gets completed. Can you clarify this on the KIP? -- Kamal On Sun, Jul 14, 2024 at 9:59 PM Harry Fallows <harryfall...@protonmail.com.invalid> wrote: > Hi Kamal, > > Thank you for reading KIP-1051! > > Yes, it's true that it can impact regular replication traffic. However, > network throughput is bounded so regardless of whether we allow it as a > config in Kafka or not, there is always a chance that replication traffic > will get throttled. Having it as a config will at least ensure that the > entire bandwidth is not taken up by replication traffic. > > I agree, the nature of the leader replication throttling is dependent on > how many followers there are, however, I don't think it's dependent on the > partition assignment strategy or the number of brokers; it should only be > dependent on the replication factor. I think it's key to point out here > that these configurations do not need to be "optimised" for use cases with > different replication factors, they just need to be set to match the > infrastructure that they are deployed in. For example if you have a maximum > network bandwidth of 200MB/s and a replication factor of 3, you may set > follower.replication.throttled.replicas to 150MB/s, to reserve some > bandwidth for other traffic (e.g. producing and consuming). In this case, > if you start with all replicas in sync, I don't think it's possible for the > follower throttling to be the sole cause of a replica falling out of sync. > It may be the case that it takes longer for an out-of-sync replica to > become in sync, but in that case the replication throttling just serves to > mitigate other traffic from getting throttled (e.g. producer traffic to a > different partition). Even so, it is possible that misconfiguring these > values could cause issues, so the potential consequences should be clearly > documented. > > I think the concern about producing spikes causing ISR issues is only an > issue if these values are poorly configured. I think in general if these > values are always configured as >= > (replicationFactor/(replicationFactor+1))*maxBandwidth (e.g. like the above > example: 3/(3+1) * 200 = 150), then even if 100% of the non-replication > traffic is producer traffic, all followers should be able to stay in sync. > > I like the idea of emitting a metric for when a quota is breached, what do > you think about having it as a gauge for number of partitions that are > currently leader of follower throttled (similar to the URP metric)? > > Kind regards, > Harry > > On Thursday, 11 July 2024 at 19:02, Kamal Chandraprakash < > kamal.chandraprak...@gmail.com> wrote: > > > Hi Harry Fallows, > > > > Thanks for the KIP! > > > > I went over both the KIP-1051 and KIP-1009. Assuming that the > > leader.replication.throttled.replicas > > and follower.replication.throttled.replicas are set to Wildcard (*) to > > apply for all the partitions in the > > broker. If we set a static value for leader and follower replication > > throttled rate, then it might impact > > the normal replication traffic. > > > > Throttling rate depends on the number of brokers in the cluster. If the > > cluster contains 100+ brokers, then > > the leader.replication.throttled.rate is shared across all the followers. > > The number of followers reading > > data from the leader depends on the partition assignment strategy. If the > > leader replication throttle is breached, > > then the follower might fail to catch-up with the leader. > > > > If there are sudden spikes in a specific set of topics/partitions in the > > cluster, then the replicas might fail to join > > the isr and can impact the cluster reliability. If we are going with this > > proposal, then we may also have to emit > > a metric to inform the administrator that the leader/follower replication > > quota is breached. > > > > -- > > Kamal > > > > On Thu, Jul 4, 2024 at 8:10 PM Harry Fallows > > harryfall...@protonmail.com.invalid wrote: > > > > > Hi everyone, > > > > > > Bumping this one last time before I call a vote. Please take a look if > > > you're interested in replication throttling and/or static/dynamic > config. > > > > > > Kind regards, > > > Harry > > > > > > On Thursday, 13 June 2024 at 19:39, Harry Fallows < > > > harryfall...@protonmail.com.INVALID> wrote: > > > > > > > Hi Hector, > > > > > > > > I did see your colleague's KIP, and I actually mentioned it in the > KIP > > > > that I have written. As I see it, both of these KIPs move towards > more > > > > easily configurable replication throttling and both should be > implemented. > > > > KIP-1009 makes it easier to enable throttling and KIP-1051 makes it > easier > > > > to apply a throttle rate. I did try to look at supporting KIP-1009 > in the > > > > discussion thread, however, I only subscribed to the mailing list > after it > > > > was published and I couldn't figure out how to respond to it in Pony > mail. > > > > I would be definitely be interested in partnering up to get both > changes > > > > across the line, whether that be by combining them or supporting both > > > > individually (I'm not sure which is best, this is my first > contribution!). > > > > > > > > I also see that KAFKA-10190 is mentioned in KIP-1009 as a related > > > > ticket. Coincidentally, I raised a PR to address this bug a couple > of days > > > > ago (https://github.com/apache/kafka/pull/16280). I think this is > also a > > > > change that will move towards more easily configurable replication > > > > throttling as it allows configuring the throttle rate across the > whole > > > > cluster via a default value. As far as I understand, this change > does not > > > > need a KIP though because it is a bugfix (the current behaviour of > ignoring > > > > the default is unintentional). > > > > > > > > Let me know what you think. > > > > > > > > Kind regards, > > > > Harry > > > > > > > > -------- Original Message -------- > > > > On 6/13/24 19:08, Hector Geraldino (BLOOMBERG/ 919 3RD A) > > > > hgerald...@bloomberg.net wrote: > > > > > > > > > Hi Harry, > > > > > > > > > > A colleague of mine opened KIP-1009: Add Broker-level Throttle > > > > > Configurations, which aims to achieve the same goal (although from > a > > > > > different angle). > > > > > > > > > > Can you please take a look and see if this would work for the > things > > > > > you have in mind? Maybe we can partner and coalesce around either > KIP and > > > > > try to push it to the end line. > > > > > > > > > > KIP: > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1009%3A+Add+Broker-level+Throttle+Configurations > > > > > > > > > > From: dev@kafka.apache.org At: 06/13/24 09:22:40 UTC-4:00To: > > > > > dev@kafka.apache.org > > > > > Subject: Re: [DISCUSS] KIP-1051 Statically configured log > replication > > > > > throttling > > > > > > > > > > Hi everyone, > > > > > > > > > > Bumping this thread, as I haven't yet had any replies. > > > > > > > > > > Kind regards, > > > > > Harry > > > > > > > > > > On Thursday, 6 June 2024 at 17:59, Harry Fallows > > > > > harryfall...@protonmail.com.INVALID wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I would like to propose a change to allow the static > configuration > > > > > > of leader > > > > > > and follower replication throttling rates. > > > > > > > > > > > > These configurations are very useful for preventing client > traffic > > > > > > from > > > > > > getting throttled by replication traffic during events that > cause a > > > > > > spike in > > > > > > replication. Currently they are only configurable dynamically, > which > > > > > > means they > > > > > > are only really useful for throttling replication traffic during > > > > > > planned > > > > > > events. By allowing these configurations to be set statically, > they > > > > > > can be used > > > > > > to prevent client traffic throttling during unplanned events. > > > > > > > > > > > > KIP: > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1051%3A+Statically+configu > > > > > > > > > red+log+replication+throttling > > > > > > > > > > > > Best regards, > > > > > > Harry Fallows >