[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494596#comment-15494596 ] Jiangjie Qin commented on KAFKA-1464: - It seems the PR title did not start with "KAFKA-1464" so the PR link is not updated. Anyway, the PR link is https://github.com/apache/kafka/pull/1776 > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372870#comment-15372870 ] Ralph Weires commented on KAFKA-1464: - Another related idea then, since those consumer rebalancing issues that result during maintenance for us drove me up the walls yesterday... Just desperately looking for a way to get this stabilized (on our v0.8.2.1) ;) Wouldn't a (manual and temporary) modification of the partition assignment also be a viable option, to prevent a given node from becoming leader for any partitions? I mean, could I issue kafka-reassign-partitions.sh with a customized partition assignment, that wouldn't actually re-assign any partitions to different brokers, but would merely change the replica *order* for several of the partitions - such that the node in question no longer is first replica for any partition? If I understand it right, the controller will always prefer the first replica as leader in balancing, so I'd just need to make sure that my node won't be the first replica for anything. All this temporarily of course, so after the maintenance I'd restore the original partition assignment back again. Should this work, or would you expect specific problems with this workaround...? Also: Let me know if this rather belongs onto the mailing list, since admittedly it isn't really related to throttling... But as a side-remark in this regard, I also tried throttling outside kafka (i.e. on side of the network, tried via wondershaper) in our problem case, but that didn't help. I'd agree this would need to be within kafka, i.e. to be able to separate out-of-sync replica recovery traffic from the rest. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372039#comment-15372039 ] Jun Rao commented on KAFKA-1464: The controller does leader balancing. So, auto.leader.rebalance.enable needs to be set on the controller. However, controller can move. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371776#comment-15371776 ] Ralph Weires commented on KAFKA-1464: - Thanks a lot for the input - so if I understand this right, the config setting James proposed would not work for me if I only set this on a single node (i.e. the node under maintenance) before starting it up again, correct? Otherwise, that would have been the perfect solution for me. I wouldn't mind running the node with the custom setting during recovery, and just restarting it again once more in the end without the setting. If this won't work, what would even happen if this setting is defined differently on various nodes in the cluster? Anyhow, alternatively I'd still even consider using that option along with a full cluster restart before (and disabling with another cluster restart afterwards), since a maintenance scenario as described happens every now and then for us, and currently really causes us major hassle for many hours, every time. Jun - I'm also not be sure if disabling leader balancing during catch up would necessarily be a good idea in general - but having / allowing the possibility to configure this some way would be a nice option to have IMO. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371583#comment-15371583 ] Jun Rao commented on KAFKA-1464: Currently, our leader balancing logic happens automatically on a per partition basis. Turning this off requires a restart of all brokers. I am not sure if we always want to disable leader balancing during catch up though. Balancing the leaders as the replicas catching up allows us to balance the client traffic to more brokers. Doing this may slow down the catch-up traffic a bit. However, this is probably fine if we do the throttling properly. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371578#comment-15371578 ] James Cheng commented on KAFKA-1464: [~r.weires], you might be able to control this a little by setting auto.leader.rebalance.enable=false. If you it to false, then the broker would come up but would not assume leadership for any partitions at all, unless manually told to. You would then have to use the kafka-preferred-replica-election.sh tool [https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-1.PreferredReplicaLeaderElectionTool], to allow it to assume leadership. This would mean that you wouldn't have the problem you described. But the downside is that you are now in charge of handling rebalancing on your own. The auto.leader.rebalance.enable flag is not changeable during runtime, tho. I think it is only read at startup time. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371402#comment-15371402 ] K Zakee commented on KAFKA-1464: I agree with Ralph. Lets say, we have a high produce rate and a system failure (as long as the kafka retention period itself), there is a lot of data to catchup and as fast as it could. Throttling catching up of out-of-sync replicas in this case may become a "chase-your-own-tail thing" and these may never be able to catchup with their leader or take days depending on produce-rate and throttle limit. Suppressing new replicas taking the leadership until the time they have all caught up sounds a better idea. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ben Stopford >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15327561#comment-15327561 ] Ralph Weires commented on KAFKA-1464: - We have similar problems as described by Jason above, in our case usually when taking a broker offline due to hardware failure (broken HD, with each broker being equipped with 2 HDs / log directories in our case). If the broker gets back online with one fresh disk and corresponding missing data (i.e. half of the partitions of that broker missing), its network link is saturated for some time by inbound traffic to catch up with replication. While the broker is re-streaming all the missing data, we are additionally experiencing problems with consumers as well. After the broker has caught up with it's missing data, the situation normalizes again quickly. To me it seems as if the partitions for which the broker already catches up soon after restart (esp. the ones from non-broken HD which just had little data missing) are causing issues if the broker becomes leader for them, while it is otherwise still clogging its incoming link with replication of the remaining data. In this scenario, I would actually prefer to just let the broker catch up with any replication it still needs to do, without it becoming leader for any partition it has. Isn't there actually a way to achieve this? I.e. just keeping a broker online with replication and all, but not having it take over any partition leadership (at least so long as there are other candidates available for leadership). Being able to toggle that behavior at run-time would be ideal, so that we would just explicitly activate it again after the maintenance interval, once the node has caught up the bulk of necessary replication. Could IMO be an alternative to any throttling approach. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241774#comment-15241774 ] Jason Ruckman commented on KAFKA-1464: -- Hello Neha, One problem we've run into, is we run a system where sometimes we replace brokers completely, in an automated fashion, and rebalance leadership and replicas across them. When we bring a new broker online, we move some partitions to it. What we see is something like this: Consider topics A, B, C with replication factors of 3 Consider brokers 1,2,3 as serving topics A,B,C A new broker 4 is replacing 1 (maybe the machine died, or whatever) A and B are relatively small, but C is large 1. Move some leaders and replicas to 4 for A and B from 2 and 3. Everything is good up until now 2. Move some leaders and replicas to 4 for C from 2 and 3. At this point, broker 4 is pegged, since it's trying to pull in data from 2 and 3 (the other two replicas) trying to catch up, so it causes timeouts for partitions it is the leader for. Brokers 2 and 3 are ok because 4 can only use 1/2 of their bandwidth to replicate, since they still have some bandwidth available to serve requests. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146250#comment-15146250 ] Neha Narkhede commented on KAFKA-1464: -- The most useful resource to throttle for is network bandwidth usage by replication, as measured by the rate of total outgoing replication data on every leader. Adding the ability on every leader to cap data transferred under an upper limit is what we are looking for. This can be a config option similar to the one we have for the log cleaner. It seems to be that it is better to have the leader send less instead of have the replica fetch less as the leader has a holistic view of the total amount of data being transferred out. Data transferred from a leader includes - Fetch requests from an in-sync replica - Fetch requests from an out-of-sync replica of a partition being reassigned - Fetch requests from an out-of-sync replica of a partition not being reassigned Data transferred across 1+2+3 should stay roughly within the configured upper limit. If the limit is crossed, we want to start throttling requests, all except the ones that fall under #1. The leader can assign the remaining available bandwidth amongst partitions that fall under #2 and #3 by allowing more bandwidth to #3 since presumably it is fine to let partitions being reassigned to catch up slower than the rest. Throttling could involve returning fewer bytes as determined by this computation for each such partition as Jay suggests. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123618#comment-15123618 ] Ismael Juma commented on KAFKA-1464: Thanks for your input [~jkreps]. With regards to the issue where a replica may never catch up, it is a good point that came up previously. One option may be to disable throttling (or increase the catch-up rate) in the case where the replica is falling further behind. One important question is whether users have enough information to be able to configure an appropriate throttling/catch-up rate that takes into account both disk IO and network bandwidth while keeping resource utilisation at an appropriate level. Thoughts? (the log cleaner has a similar config: `log.cleaner.io.max.bytes.per.second`, although it seems simpler to figure out). > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15123956#comment-15123956 ] Jiangjie Qin commented on KAFKA-1464: - It looks our purpose is to minimize user impact during replica catching up. From broker point of view, as long as client request latency is acceptable we should fully utilize the bandwidth we have to let replicas keep up. We should be able to measure the user experience by checking Queuing time of requests from and response to clients. If that is the case, maybe we can let user set an SLA for latency. And we will not throttle replication as long as the user ProduceRequest / FetchRequest queuing time. Otherwise, we will throttle the fetching from out of sync replica (We probably don't want to throttle in-sync replicas). > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121706#comment-15121706 ] Jay Kreps commented on KAFKA-1464: -- Another issue this raises is that a partition might have a natural rate of new data coming in that is higher than the catch-up rate in which case if it ever falls out of sync it can never catch up. This is possible today to some extent but not a common problem since the followers are, if anything, a bit faster than the leader and have no throttle. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121691#comment-15121691 ] Jay Kreps commented on KAFKA-1464: -- I agree that the key difference is in-sync vs out-of-sync replicas. In-sync replicas add to the commit time so they are really the highest priority and generally should add much load anyway. Out-of-sync replicas are the catch up case that add load. Blindly reducing the fetch size for out-of-sync partitions probably would make things worse though. Large fetch size is actually good for efficiency and shrinking it will add overhead (more physical I/O, more FS reads, more requests overall, etc). However it should be possible to throttle dynamically at the partition level for out of sync partitions. This could be done by dynamically omitting partitions that have exceeded their throttle rate from either the fetch request that the follower sends or from the fetch response the leader constructs. For example when handling follower fetch requests the leader could check the observed fetch rate for that follower and whether it is in sync or not; if the rate exceeds the configured maximum for catch-up traffic the leader would ignore that partition and only answer for other partitions (if there are no other partitions the purgatory time would need to be calculated to be no greater than the time in which the fetch rate might come down below the throttle). This would allow for dynamically throttling down the catch up traffic without reducing efficiency. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15105639#comment-15105639 ] Eno Thereska commented on KAFKA-1464: - An alternative to throttling background maintenance traffic is to use a priority levels (just two: foreground and background). This has the advantage of being fairly simple and allows for important replication work to proceed fast if there is little or no foreground traffic. If most of the contention happens at the disk (as [~mjuarez] seems to indicate) then priorities implemented as two queues at the receiving end could be sufficient. However, if the network is a problem as well, then throttling would probably work best since it limits background traffic at the source. > Add a throttling option to the Kafka replication tool > - > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication >Affects Versions: 0.8.0 >Reporter: mjuarez >Assignee: Ismael Juma >Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654015#comment-14654015 ] Ismael Juma commented on KAFKA-1464: I'd like to take a look at this. In a separate conversation, [~junrao] suggested that the throttling should perhaps only happen for out of sync replicas. Add a throttling option to the Kafka replication tool - Key: KAFKA-1464 URL: https://issues.apache.org/jira/browse/KAFKA-1464 Project: Kafka Issue Type: New Feature Components: replication Affects Versions: 0.8.0 Reporter: mjuarez Assignee: Ismael Juma Priority: Minor Labels: replication, replication-tools When performing replication on new nodes of a Kafka cluster, the replication process will use all available resources to replicate as fast as possible. This causes performance issues (mostly disk IO and sometimes network bandwidth) when doing this in a production environment, in which you're trying to serve downstream applications, at the same time you're performing maintenance on the Kafka cluster. An option to throttle the replication to a specific rate (in either MB/s or activities/second) would help production systems to better handle maintenance tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654052#comment-14654052 ] Jun Rao commented on KAFKA-1464: Another thing that we need to be a bit careful is that typically throttling just slows down a request. However, in our case, a single replica fetch request may have multiple partitions and we don't want to slow down the in-sync replicas. Perhaps we should always respond asap but just gives back less data for out-of-sync replicas. Add a throttling option to the Kafka replication tool - Key: KAFKA-1464 URL: https://issues.apache.org/jira/browse/KAFKA-1464 Project: Kafka Issue Type: New Feature Components: replication Affects Versions: 0.8.0 Reporter: mjuarez Assignee: Ismael Juma Priority: Minor Labels: replication, replication-tools When performing replication on new nodes of a Kafka cluster, the replication process will use all available resources to replicate as fast as possible. This causes performance issues (mostly disk IO and sometimes network bandwidth) when doing this in a production environment, in which you're trying to serve downstream applications, at the same time you're performing maintenance on the Kafka cluster. An option to throttle the replication to a specific rate (in either MB/s or activities/second) would help production systems to better handle maintenance tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-1464) Add a throttling option to the Kafka replication tool
[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005052#comment-14005052 ] Jon Bringhurst commented on KAFKA-1464: --- Although this would be nice to have, something similar exists as part of linux. You can accomplish the same type of thing by first adding the process into a net_cls cgroup. Then, you can use the tc command to classify the marked packets into an htb qdisc (possibly with an stb further down the tree to completely prevent starvation) to throttle the packets coming from kafka. * https://www.kernel.org/doc/Documentation/cgroups/net_cls.txt * http://www.tldp.org/ * http://linux.die.net/man/8/tc The blkio cgroup works in a similar way to throttle disk io. Add a throttling option to the Kafka replication tool - Key: KAFKA-1464 URL: https://issues.apache.org/jira/browse/KAFKA-1464 Project: Kafka Issue Type: New Feature Components: replication Affects Versions: 0.8.0 Reporter: Marcos Juarez Assignee: Neha Narkhede Priority: Minor Labels: replication, replication-tools When performing replication on new nodes of a Kafka cluster, the replication process will use all available resources to replicate as fast as possible. This causes performance issues (mostly disk IO and sometimes network bandwidth) when doing this in a production environment, in which you're trying to serve downstream applications, at the same time you're performing maintenance on the Kafka cluster. An option to throttle the replication to a specific rate (in either MB/s or activities/second) would help production systems to better handle maintenance tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.2#6252)