Thanks for the KIP, Gokul! I like the overall premise - I think it's more user-friendly to have configs for this than to have users implement their own config policy -> so unless it's very complex to implement, it seems worth it. I agree that having the topic policy on the CreatePartitions path makes sense as well.
Multi-tenancy was a good point. It would be interesting to see how easy it is to extend the max partition limit to a per-user basis. Perhaps this can be done in a follow-up KIP, as a natural extension of the feature. I'm wondering whether there's a need to enforce this on internal topics, though. Given they're internal and critical to the function of Kafka, I believe we'd rather always ensure they're created, regardless if over some user-set limit. It brings up the question of forward compatibility - what happens if a user's cluster is at the maximum partition capacity, yet a new release of Kafka introduces a new topic (e.g KIP-500)? Best, Stanislav On Fri, Apr 24, 2020 at 2:39 PM Gokul Ramanan Subramanian < gokul24...@gmail.com> wrote: > Hi Tom. > > With KIP-578, we are not trying to model the load on each partition, and > come up with an exact limit on what the cluster or broker can handle in > terms of number of partitions. We understand that not all partitions are > equal, and the actual load per partition varies based on the message size, > throughput, whether the broker is a leader for that partition or not etc. > > What we are trying to achieve with KIP-578 is to disallow a pathological > number of partitions that will surely put the cluster in bad shape. For > example, in KIP-578's appendix, we have described a case where we could not > delete a topic with 30k partitions, because the brokers could not > handle all the work that needed to be done. We have also described how > a producer performance test with 10k partitions observed basically 0 > throughput. In these cases, having a limit on number of partitions > would allow the cluster to produce a graceful error message at topic > creation time, and prevent the cluster from entering a pathological state. > These are not just hypotheticals. We definitely see many of these > pathological cases happen in production, and we would like to avoid them. > > The actual limit on number of partitions is something we do not want to > suggest in the KIP. The limit will depend on various tests that owners of > their clusters will have to perform, including perf tests, identifying > topic creation / deletion times, etc. For example, the tests we did for the > KIP-578 appendix were enough to convince us that we should not have > anywhere close to 10k partitions on the setup we describe there. > > What we want to do with KIP-578 is provide the flexibility to set a limit > on number of partitions based on tests cluster owners choose to perform. > Cluster owners can do the tests however often they wish and dynamically > adjust the limit on number of partitions. For example, we found in our > production environment that we don't want to have more than 1k partitions > on an m5.large EC2 broker instances, or more than 300 partitions on a > t3.medium EC2 broker, for typical produce / consume use cases. > > Cluster owners are free to not configure the limit on number of partitions > if they don't want to spend the time coming up with a limit. The limit > defaults to INT32_MAX, which is basically infinity in this context, and > should be practically backwards compatible with current behavior. > > Further, the limit on number of partitions should not come in the way of > rebalancing tools under normal operation. For example, if the partition > limit per broker is set to 1k, unless the number of partitions comes close > to 1k, there should be no impact on rebalancing tools. Only when the number > of partitions comes close to 1k, will rebalancing tools be impacted, but at > that point, the cluster is already at its limit of functioning (per some > definition that was used to set the limit in the first place). > > Finally, I want to end this long email by suggesting that the partition > assignment algorithm itself does not consider the load on various > partitions before assigning partitions to brokers. In other words, it > treats all partitions as equal. The idea of having a limit on number of > partitions is not mis-aligned with this tenet. > > Thanks. > > On Tue, Apr 21, 2020 at 9:39 AM Tom Bentley <tbent...@redhat.com> wrote: > > > Hi Gokul, > > > > the partition assignment algorithm needs to be aware of the partition > > > limits. > > > > > > > I agree, if you have limits then anything doing reassignment would need > > some way of knowing what they were. But the thing is that I'm not really > > sure how you would decide what the limits ought to be. > > > > > > > To illustrate this, imagine that you have 3 brokers (1, 2 and 3), > > > with 10, 20 and 30 partitions each respectively, and a limit of 40 > > > partitions on each broker enforced via a configurable policy class (the > > one > > > you recommended). While the policy class may accept a topic creation > > > request for 11 partitions with a replication factor of 2 each (because > it > > > is satisfiable), the non-pluggable partition assignment algorithm (in > > > AdminUtils.assignReplicasToBrokers and a few other places) has to know > > not > > > to assign the 11th partition to broker 3 because it would run out of > > > partition capacity otherwise. > > > > > > > I know this is only a toy example, but I think it also serves to > illustrate > > my point above. How has a limit of 40 partitions been arrived at? In real > > life different partitions will impart a different load on a broker, > > depending on all sorts of factors (which topics they're for, the > throughput > > and message size for those topics, etc). By saying that a broker should > not > > have more than 40 partitions assigned I think you're making a big > > assumption that all partitions have the same weight. You're also limiting > > the search space for finding an acceptable assignment. Cluster balancers > > usually use some kind of heuristic optimisation algorithm for figuring > out > > assignments of partitions to brokers, and it could be that the best (or > at > > least a good enough) solution requires assigning the least loaded 41 > > partitions to one broker. > > > > The point I'm trying to make here is whatever limit is chosen it's > probably > > been chosen fairly arbitrarily. Even if it's been chosen based on some > > empirical evidence of how a particular cluster behaves it's likely that > > that evidence will become obsolete as the cluster evolves to serve the > > needs of the business running it (e.g. some hot topic gets repartitioned, > > messages get compressed with some new algorithm, some new topics need to > be > > created). For this reason I think the problem you're trying to solve via > > policy (whether that was implemented in a pluggable way or not) is really > > better solved by automating the cluster balancing and having that cluster > > balancer be able to reason about when the cluster has too few brokers for > > the number of partitions, rather than placing some limit on the sizing > and > > shape of the cluster up front and then hobbling the cluster balancer to > > work within that. > > > > I think it might be useful to describe in the KIP how users would be > > expected to arrive at values for these configs (both on day 1 and in an > > evolving production cluster), when this solution might be better than > using > > a cluster balancer and/or why cluster balancers can't be trusted to avoid > > overloading brokers. > > > > Kind regards, > > > > Tom > > > > > > On Mon, Apr 20, 2020 at 7:22 PM Gokul Ramanan Subramanian < > > gokul24...@gmail.com> wrote: > > > > > This is good reference Tom. I did not consider this approach at all. I > am > > > happy to learn about it now. > > > > > > However, I think that partition limits are not "yet another" policy > > > configuration. Instead, they are fundamental to partition assignment. > > i.e. > > > the partition assignment algorithm needs to be aware of the partition > > > limits. To illustrate this, imagine that you have 3 brokers (1, 2 and > 3), > > > with 10, 20 and 30 partitions each respectively, and a limit of 40 > > > partitions on each broker enforced via a configurable policy class (the > > one > > > you recommended). While the policy class may accept a topic creation > > > request for 11 partitions with a replication factor of 2 each (because > it > > > is satisfiable), the non-pluggable partition assignment algorithm (in > > > AdminUtils.assignReplicasToBrokers and a few other places) has to know > > not > > > to assign the 11th partition to broker 3 because it would run out of > > > partition capacity otherwise. > > > > > > To achieve the ideal end that you are imagining (and I can totally > > > understand where you are coming from vis-a-vis the extensibility of > your > > > solution wrt the one in the KIP), that would require extracting the > > > partition assignment logic itself into a pluggable class, and for which > > we > > > could provide a custom implementation. I am afraid that would add > > > complexity that I am not sure we want to undertake. > > > > > > Do you see sense in what I am saying? > > > > > > Thanks. > > > > > > On Mon, Apr 20, 2020 at 3:59 PM Tom Bentley <tbent...@redhat.com> > wrote: > > > > > > > Hi Gokul, > > > > > > > > Leaving aside the question of how Kafka scales, I think the proposed > > > > solution, limiting the number of partitions in a cluster or > per-broker, > > > is > > > > a policy which ought to be addressable via the pluggable policies > (e.g. > > > > create.topic.policy.class.name). Unfortunately although there's a > > policy > > > > for topic creation, it's currently not possible to enforce a policy > on > > > > partition increase. It would be more flexible to be able enforce this > > > kind > > > > of thing via a pluggable policy, and it would also avoid the > situation > > > > where different people each want to have a config which addresses > some > > > > specific use case or problem that they're experiencing. > > > > > > > > Quite a while ago I proposed KIP-201 to solve this issue with > policies > > > > being easily circumvented, but it didn't really make any progress. > I've > > > > looked at it again in some detail more recently and I think something > > > might > > > > be possible following the work to make all ZK writes happen on the > > > > controller. > > > > > > > > Of course, this is just my take on it. > > > > > > > > Kind regards, > > > > > > > > Tom > > > > > > > > On Thu, Apr 16, 2020 at 11:47 AM Gokul Ramanan Subramanian < > > > > gokul24...@gmail.com> wrote: > > > > > > > > > Hi. > > > > > > > > > > For the sake of expediting the discussion, I have created a > prototype > > > PR: > > > > > https://github.com/apache/kafka/pull/8499. Eventually, (if and) > when > > > the > > > > > KIP is accepted, I'll modify this to add the full implementation > and > > > > tests > > > > > etc. in there. > > > > > > > > > > Would appreciate if a Kafka committer could share their thoughts, > so > > > > that I > > > > > can more confidently start the voting thread. > > > > > > > > > > Thanks. > > > > > > > > > > On Thu, Apr 16, 2020 at 11:30 AM Gokul Ramanan Subramanian < > > > > > gokul24...@gmail.com> wrote: > > > > > > > > > > > Thanks for your comments Alex. > > > > > > > > > > > > The KIP proposes using two configurations max.partitions and > > > > > > max.broker.partitions. It does not enforce their use. The default > > > > values > > > > > > are pretty large (INT MAX), therefore should be non-intrusive. > > > > > > > > > > > > In multi-tenant environments and in partition assignment and > > > > rebalancing, > > > > > > the admin could (a) use the default values which would yield > > similar > > > > > > behavior to now, (b) set very high values that they know is > > > sufficient, > > > > > (c) > > > > > > dynamically re-adjust the values should the business requirements > > > > change. > > > > > > Note that the two configurations are cluster-wide, so they can be > > > > updated > > > > > > without restarting the brokers. > > > > > > > > > > > > The quota system in Kafka seems to be geared towards limiting > > traffic > > > > for > > > > > > specific clients or users, or in the case of replication, to > > leaders > > > > and > > > > > > followers. The quota configuration itself is very similar to the > > one > > > > > > introduced in this KIP i.e. just a few configuration options to > > > specify > > > > > the > > > > > > quota. The main difference is that the quota system is far more > > > > > > heavy-weight because it needs to be applied to traffic that is > > > flowing > > > > > > in/out constantly. Whereas in this KIP, we want to limit number > of > > > > > > partition replicas, which gets modified rarely by comparison in a > > > > typical > > > > > > cluster. > > > > > > > > > > > > Hope this addresses your comments. > > > > > > > > > > > > On Thu, Apr 9, 2020 at 12:53 PM Alexandre Dupriez < > > > > > > alexandre.dupr...@gmail.com> wrote: > > > > > > > > > > > >> Hi Gokul, > > > > > >> > > > > > >> Thanks for the KIP. > > > > > >> > > > > > >> From what I understand, the objective of the new configuration > is > > to > > > > > >> protect a cluster from an overload driven by an excessive number > > of > > > > > >> partitions independently from the load handled on the partitions > > > > > >> themselves. As such, the approach uncouples the data-path load > > from > > > > > >> the number of unit of distributions of throughput and intends to > > > avoid > > > > > >> the degradation of performance exhibited in the test results > > > provided > > > > > >> with the KIP by setting an upper-bound on that number. > > > > > >> > > > > > >> Couple of comments: > > > > > >> > > > > > >> 900. Multi-tenancy - one concern I would have with a cluster and > > > > > >> broker-level configuration is that it is possible for a user to > > > > > >> consume a large proportions of the allocatable partitions within > > the > > > > > >> configured limit, leaving other users with not enough partitions > > to > > > > > >> satisfy their requirements. > > > > > >> > > > > > >> 901. Quotas - an approach in Apache Kafka to set-up an > upper-bound > > > on > > > > > >> resource consumptions is via client/user quotas. Could this > > > framework > > > > > >> be leveraged to add this limit? > > > > > >> > > > > > >> 902. Partition assignment - one potential problem with the new > > > > > >> repartitioning scheme is that if a subset of brokers have > reached > > > > > >> their number of assignable partitions, yet their data path is > > > > > >> under-loaded, new topics and/or partitions will be assigned > > > > > >> exclusively to other brokers, which could increase the > likelihood > > of > > > > > >> data-path load imbalance. Fundamentally, the isolation of the > > > > > >> constraint on the number of partitions from the data-path > > throughput > > > > > >> can have conflicting requirements. > > > > > >> > > > > > >> 903. Rebalancing - as a corollary to 902, external tools used to > > > > > >> balance ingress throughput may adopt an incremental approach in > > > > > >> partition re-assignment to redistribute load, and could hit the > > > limit > > > > > >> on the number of partitions on a broker when a (too) > conservative > > > > > >> limit is used, thereby over-constraining the objective function > > and > > > > > >> reducing the migration path. > > > > > >> > > > > > >> Thanks, > > > > > >> Alexandre > > > > > >> > > > > > >> Le jeu. 9 avr. 2020 à 00:19, Gokul Ramanan Subramanian > > > > > >> <gokul24...@gmail.com> a écrit : > > > > > >> > > > > > > >> > Hi. Requesting you to take a look at this KIP and provide > > > feedback. > > > > > >> > > > > > > >> > Thanks. Regards. > > > > > >> > > > > > > >> > On Wed, Apr 1, 2020 at 4:28 PM Gokul Ramanan Subramanian < > > > > > >> > gokul24...@gmail.com> wrote: > > > > > >> > > > > > > >> > > Hi. > > > > > >> > > > > > > > >> > > I have opened KIP-578, intended to provide a mechanism to > > limit > > > > the > > > > > >> number > > > > > >> > > of partitions in a Kafka cluster. Kindly provide feedback on > > the > > > > KIP > > > > > >> which > > > > > >> > > you can find at > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions > > > > > >> > > > > > > > >> > > I want to specially thank Stanislav Kozlovski who helped in > > > > > >> formulating > > > > > >> > > some aspects of the KIP. > > > > > >> > > > > > > > >> > > Many thanks, > > > > > >> > > > > > > > >> > > Gokul. > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- Best, Stanislav