rondagostino opened a new pull request, #13280: URL: https://github.com/apache/kafka/pull/13280
…topic counts Performance of KRaft metadata image changes is currently O(<# of topics in cluster>). This means the amount of time it takes to create just a single topic scales linearly with the number of topics in the entire cluster. This impact both controllers and brokers because both use the metadata image to represent the KRaft metadata log. The performance of these changes should scale with the number of topics being changed – creating a single topic should perform similarly regardless of the number of topics in the cluster. This patch introduces a dependency on the [Paguro](https://github.com/GlenKPeterson/Paguro/) library for immutable/persistent collection support in Java and leverages persistent data structures to avoid copying the entire TopicsImage upon every change. We choose this library because it is relatively small and [well-integrated](https://github.com/GlenKPeterson/Paguro/blob/main/inheritanceHierarchy.pdf) with the existing Java Collections class hierarchy (the latter property especially helps to minimize the changes required to introduce the library into the existing code base). The patch also adds the following JMH benchmarks demonstrating the resulting performance changes: - `TopicsImageSingleRecordChangeBenchmark` tracks how long it takes to create a new topic. This is the benchmark that clearly identifies the O(N) behavior in the existing code and that most dramatically illustrates a performance improvement. As shown below, the existing code takes several orders of magnitude longer to make a single change than the new code. The existing code, with 12,500 topics, took 1.4 milliseconds on my laptop and grows more or less linearly as the number of topics grows. The new code took a constant amount of time (~250 nanoseconds) regardless of the number of topics in the cluster. The reason for the improvement is because it is inexpensive to add, update, or delete an entry in an immutable, persistent map to create a new persistent map. The new map shares the vast amount of the old map; only the root node and any nodes along the path to the node that must change are swapped out, and when the reference to the old map is released the swapped-out nodes can be garbage-collected. **Current Code, unpatched** Total Topic Count | nanoseconds/op | error -- | -- | -- 12,500 | 1,410,901 | 153,461 25,000 | 3,570,451 | 221,992 50,000 | 14,143,125 | 1,123,616 100,000 | 31,126,930 | 4,243,625 **Updated Code** Total Topic Count | nanoseconds/op | error -- | -- | -- 12,500 | 258 | 13 25,000 | 265 | 8 50,000 | 273 | 5 100,000 | 222 | 4 - `TopicsImageZonalOutageBenchmark` simulates a zonal outage where each broker in the zone will lose its session – in this benchmark we assume the controller deals with them one by one rather than demoting 1/3 of the cluster all at once. Since the number of topics per broker does not change very much, we expect O(N) behavior with the current code but not with the updated code, so we should see a performance improvement here as well -- and in fact we do. The existing code scales with the number of topics in the cluster, thus the time always doubles as the cluster size doubles, increasing from 5ms to 47ms (a factor of 9) as the cluster scales by a factor of 8. The updated code should scale with the number of affected topics, which in this case, based on the (debatable) values chosen of 10000 replicas per broker and 10 partitions per topic, means a factor of 1.6 (from 4167 topics affected to 6667 topics affected) as the cluster scaled by a factor of 8. In fact we see the time spent increasing by a factor of 2.6 (from 4.4 ms to 11.6 ms) when the cluster scaled by that factor of 8. This a bit higher than expected, but it is still sub-linear (and there is some +/- error in these numbers, so the sub-linear behavior is the real point as opposed to the specific number). **Current Code, unpatched** Total Topic Count | milliseconds/op | error | (Brokers Impacted) | (Topics Impacted) -- | -- | -- | -- | -- 12,500 | 5.2 | 0.4 | 1/36 | 4,167 25,000 | 10.6 | 0.1 | 1/75 | 5,000 50,000 | 21.7 | 0.4 | 1/150 | 6,667 100,000 | 47.7 | 5.2 | 1/300 | 6,667 **Updated Code** Total Topic Count | milliseconds/op | error | (Brokers Impacted) | (Topics Impacted) -- | -- | -- | -- | -- 12,500 | 4.4 | 0.2 | 1/36 | 4,167 25,000 | 6.9 | 0.2 | 1/75 | 5,000 50,000 | 10.2 | 2.5 | 1/150 | 6,667 100,000 | 11.6 | 2.8 | 1/300 | 6,667 - `TopicsImageSnapshotLoadBenchmark` simulates the loading of a snapshot when the broker starts – i.e. load up 100,000 topics/1M partitions from scratch and commit them all at once. We would expect to see some performance degradation here in the updated code, and the question really was by how much. This is the benchmark that simulates the case where every topic is affected – e.g when we load the initial snapshot during startup. We expect the persistent data structures to perform worse here because every perturbation to create a new tree implies replacing some path in the old tree and sharing the remainder. That’s a lot of replacing and a lot of garbage collecting. It turns out to be a 20%-40% penalty. This happens far less frequently than the scenarios described above – people create topics and ISRs change more frequently than the brokers roll – so the performance degradation here seems like a reasonable penalty to accept. **Current Code, unpatched** Total Topic Count | milliseconds/op | error -- | -- | -- 12,500 | 9.2 | 0.2 25,000 | 20.7 | 1.9 50,000 | 41.6 | 2.9 100,000 | 110.8 | 12.4 **Updated Code** Total Topic Count | milliseconds/op | error -- | -- | -- 12,500 | 13.7 | 1.3 25,000 | 28.9 | 1.4 50,000 | 67.6 | 1.6 100,000 | 126.0 | 12.9 - `KRaftMetadataRequestBenchmark` is a version of an existing set of benchmarks that we have for ZooKeeper-based brokers. It demonstrates a *potential* slowdown in read performance with the new code of between 5% and 10% when requesting metadata for all topics -- but the evidence is not particularly strong as some cases actually perform better. Overall there doesn't seem to be anything here that strongly cautions against this change. **KRaftMetadataRequestBenchmark.testMetadataRequestForAllTopics** **Current Code, unpatched** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 1,372,555.544 ± 15168.485 ns/op 10 1000 avgt 15 2,726,198.798 ± 31911.035 ns/op 10 5000 avgt 15 41,553,723.361 ± 1394092.903 ns/op 20 500 avgt 15 2,373,810.148 ± 28320.684 ns/op 20 1000 avgt 15 5,077,005.645 ± 344757.315 ns/op 20 5000 avgt 15 50,903,118.952 ± 842824.639 ns/op 50 500 avgt 15 5,592,480.092 ± 38800.995 ns/op 50 1000 avgt 15 11,278,004.176 ± 140577.692 ns/op 50 5000 avgt 15 97,650,987.593 ± 2605902.517 ns/op ``` **KRaftMetadataRequestBenchmark.testMetadataRequestForAllTopics** **Updated Code** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 1,437,036.379 ± 10187.868 ns/op 10 1000 avgt 15 2,957,572.571 ± 39259.772 ns/op 10 5000 avgt 15 40,949,310.393 ± 354110.400 ns/op 20 500 avgt 15 2,493,094.563 ± 18131.215 ns/op 20 1000 avgt 15 5,224,699.766 ± 198612.245 ns/op 20 5000 avgt 15 55,643,648.154 ± 935800.708 ns/op 50 500 avgt 15 5,731,310.891 ± 289599.505 ns/op 50 1000 avgt 15 11,708,291.589 ± 63063.128 ns/op 50 5000 avgt 15 94,717,768.691 ± 1248511.062 ns/op ``` There are 2 other benchmarks in the set that seem to perform comparably with and without the patch. **KRaftMetadataRequestBenchmark.testRequestToJson** **Current Code, unpatched** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 674.874 ± 19.315 ns/op 10 1000 avgt 15 643.048 ± 4.638 ns/op 10 5000 avgt 15 696.829 ± 23.999 ns/op 20 500 avgt 15 672.617 ± 4.812 ns/op 20 1000 avgt 15 674.492 ± 6.206 ns/op 20 5000 avgt 15 677.546 ± 2.301 ns/op 50 500 avgt 15 682.702 ± 3.841 ns/op 50 1000 avgt 15 634.786 ± 6.009 ns/op 50 5000 avgt 15 678.107 ± 8.479 ns/op ``` **KRaftMetadataRequestBenchmark.testRequestToJson** **Updated Code** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 678.702 ± 3.331 ns/op 10 1000 avgt 15 659.232 ± 3.126 ns/op 10 5000 avgt 15 678.725 ± 5.893 ns/op 20 500 avgt 15 666.064 ± 2.042 ns/op 20 1000 avgt 15 670.959 ± 2.950 ns/op 20 5000 avgt 15 670.517 ± 2.473 ns/op 50 500 avgt 15 672.154 ± 7.125 ns/op 50 1000 avgt 15 665.008 ± 2.272 ns/op 50 5000 avgt 15 669.210 ± 27.191 ns/op ``` **KRaftMetadataRequestBenchmark.testTopicIdInfo** **Current Code, unpatched** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 10.760 ± 0.078 ns/op 10 1000 avgt 15 11.118 ± 0.281 ns/op 10 5000 avgt 15 10.882 ± 0.192 ns/op 20 500 avgt 15 10.822 ± 0.121 ns/op 20 1000 avgt 15 10.966 ± 0.569 ns/op 20 5000 avgt 15 10.764 ± 0.120 ns/op 50 500 avgt 15 10.823 ± 0.081 ns/op 50 1000 avgt 15 10.755 ± 0.154 ns/op 50 5000 avgt 15 10.694 ± 0.068 ns/op ``` **KRaftMetadataRequestBenchmark.testTopicIdInfo** **Updated Code** ``` (partitionCount) (topicCount) Mode Cnt Score Error Units 10 500 avgt 15 10.493 ± 0.056 ns/op 10 1000 avgt 15 10.507 ± 0.059 ns/op 10 5000 avgt 15 10.455 ± 0.055 ns/op 20 500 avgt 15 10.424 ± 0.035 ns/op 20 1000 avgt 15 10.476 ± 0.139 ns/op 20 5000 avgt 15 10.454 ± 0.079 ns/op 50 500 avgt 15 10.891 ± 0.152 ns/op 50 1000 avgt 15 10.509 ± 0.073 ns/op 50 5000 avgt 15 10.479 ± 0.059 ns/op ``` ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org