[GitHub] [kafka] rondagostino opened a new pull request, #13280: KAFKA-14735: Improve KRaft metadata image change performance at high …

via GitHub Mon, 20 Feb 2023 12:04:19 -0800


rondagostino opened a new pull request, #13280:
URL: https://github.com/apache/kafka/pull/13280


   …topic counts
   
   Performance of KRaft metadata image changes is currently O(<# of topics in 
cluster>). This means the amount of time it takes to create just a single topic 
scales linearly with the number of topics in the entire cluster. This impact 
both controllers and brokers because both use the metadata image to represent 
the KRaft metadata log. The performance of these changes should scale with the 
number of topics being changed – creating a single topic should perform 
similarly regardless of the number of topics in the cluster.
   
   This patch introduces a dependency on the 
[Paguro](https://github.com/GlenKPeterson/Paguro/) library for 
immutable/persistent collection support in Java and leverages persistent data 
structures to avoid copying the entire TopicsImage upon every change.  We 
choose this library because it is relatively small and 
[well-integrated](https://github.com/GlenKPeterson/Paguro/blob/main/inheritanceHierarchy.pdf)
 with the existing Java Collections class hierarchy (the latter property 
especially helps to minimize the changes required to introduce the library into 
the existing code base).  The patch also adds the following JMH benchmarks 
demonstrating the resulting performance changes:
   
   - `TopicsImageSingleRecordChangeBenchmark` tracks how long it takes to 
create a new topic.  This is the benchmark that clearly identifies the O(N) 
behavior in the existing code and that most dramatically illustrates a 
performance improvement.
   
   As shown below, the existing code takes several orders of magnitude longer 
to make a single change than the new code.  The existing code, with 12,500 
topics, took 1.4 milliseconds on my laptop and grows more or less linearly as 
the number of topics grows.  The new code took a constant amount of time (~250 
nanoseconds) regardless of the number of topics in the cluster.
   
   The reason for the improvement is because it is inexpensive to add, update, 
or delete an entry in an immutable, persistent map to create a new persistent 
map.  The new map shares the vast amount of the old map; only the root node and 
any nodes along the path to the node that must change are swapped out, and when 
the reference to the old map is released the swapped-out nodes can be 
garbage-collected.
   
   **Current Code, unpatched**
   Total Topic Count | nanoseconds/op | error
   -- | -- | --
   12,500 | 1,410,901 | 153,461
   25,000 | 3,570,451 | 221,992
   50,000 | 14,143,125 | 1,123,616
   100,000 | 31,126,930 | 4,243,625
   
   **Updated Code**
   Total Topic Count | nanoseconds/op | error
   -- | -- | --
   12,500 | 258 | 13
   25,000 | 265 | 8
   50,000 | 273 | 5
   100,000 | 222 | 4
   
   
   - `TopicsImageZonalOutageBenchmark` simulates a zonal outage where each 
broker in the zone will lose its session – in this benchmark we assume the 
controller deals with them one by one rather than demoting 1/3 of the cluster 
all at once.  Since the number of topics per broker does not change very much, 
we expect O(N) behavior with the current code but not with the updated code, so 
we should see a performance improvement here as well -- and in fact we do.
   
   The existing code scales with the number of topics in the cluster, thus the 
time always doubles as the cluster size doubles, increasing from 5ms to 47ms (a 
factor of 9) as the cluster scales by a factor of 8.  The updated code should 
scale with the number of affected topics, which in this case, based on the 
(debatable) values chosen of 10000 replicas per broker and 10 partitions per 
topic, means a factor of 1.6 (from 4167 topics affected to 6667 topics 
affected) as the cluster scaled by a factor of 8.  In fact we see the time 
spent increasing by a factor of 2.6 (from 4.4 ms to 11.6 ms) when the cluster 
scaled by that factor of 8.  This a bit higher than expected, but it is still 
sub-linear (and there is some +/- error in these numbers, so the sub-linear 
behavior is the real point as opposed to the specific number).
   
   **Current Code, unpatched**
   Total Topic Count | milliseconds/op | error | (Brokers Impacted) | (Topics 
Impacted)
   -- | -- | -- | -- | --
   12,500 | 5.2 | 0.4 | 1/36 | 4,167
   25,000 | 10.6 | 0.1 | 1/75 | 5,000
   50,000 | 21.7 | 0.4 | 1/150 | 6,667
   100,000 | 47.7 | 5.2 | 1/300 | 6,667
   
   **Updated Code**
   Total Topic Count | milliseconds/op | error | (Brokers Impacted) | (Topics 
Impacted)
   -- | -- | -- | -- | --
   12,500 | 4.4 | 0.2 | 1/36 | 4,167
   25,000 | 6.9 | 0.2 | 1/75 | 5,000
   50,000 | 10.2 | 2.5 | 1/150 | 6,667
   100,000 | 11.6 | 2.8 | 1/300 | 6,667
   
   
   - `TopicsImageSnapshotLoadBenchmark` simulates the loading of a snapshot 
when the broker starts – i.e. load up 100,000 topics/1M partitions from scratch 
and commit them all at once.  We would expect to see some performance 
degradation here in the updated code, and the question really was by how much.
   
   This is the benchmark that simulates the case where every topic is affected 
– e.g when we load the initial snapshot during startup.  We expect the 
persistent data structures to perform worse here because every perturbation to 
create a new tree implies replacing some path in the old tree and sharing the 
remainder.  That’s a lot of replacing and a lot of garbage collecting.  It 
turns out to be a 20%-40% penalty.   This happens far less frequently than the 
scenarios described above – people create topics and ISRs change more 
frequently than the brokers roll – so the performance degradation here seems 
like a reasonable penalty to accept.
   
   **Current Code, unpatched**
   Total Topic Count | milliseconds/op | error
   -- | -- | --
   12,500 | 9.2 | 0.2
   25,000 | 20.7 | 1.9
   50,000 | 41.6 | 2.9
   100,000 | 110.8 | 12.4
   
   **Updated Code**
   Total Topic Count | milliseconds/op | error
   -- | -- | --
   12,500 | 13.7 | 1.3
   25,000 | 28.9 | 1.4
   50,000 | 67.6 | 1.6
   100,000 | 126.0 | 12.9
   
   
   - `KRaftMetadataRequestBenchmark` is a version of an existing set of 
benchmarks that we have for ZooKeeper-based brokers.  It demonstrates a 
*potential* slowdown in read performance with the new code of between 5% and 
10% when requesting metadata for all topics -- but the evidence is not 
particularly strong as some cases actually perform better.  Overall there 
doesn't seem to be anything here that strongly cautions against this change.
   
   **KRaftMetadataRequestBenchmark.testMetadataRequestForAllTopics**
   **Current Code, unpatched**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15   1,372,555.544 ±   15168.485  
ns/op
                 10          1000  avgt   15   2,726,198.798 ±   31911.035  
ns/op
                 10          5000  avgt   15  41,553,723.361 ± 1394092.903  
ns/op
                 20           500  avgt   15   2,373,810.148 ±   28320.684  
ns/op
                 20          1000  avgt   15   5,077,005.645 ±  344757.315  
ns/op
                 20          5000  avgt   15  50,903,118.952 ±  842824.639  
ns/op
                 50           500  avgt   15   5,592,480.092 ±   38800.995  
ns/op
                 50          1000  avgt   15  11,278,004.176 ±  140577.692  
ns/op
                 50          5000  avgt   15  97,650,987.593 ± 2605902.517  
ns/op
   ```
   
   **KRaftMetadataRequestBenchmark.testMetadataRequestForAllTopics**
   **Updated Code**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15   1,437,036.379 ±   10187.868  
ns/op
                 10          1000  avgt   15   2,957,572.571 ±   39259.772  
ns/op
                 10          5000  avgt   15  40,949,310.393 ±  354110.400  
ns/op
                 20           500  avgt   15   2,493,094.563 ±   18131.215  
ns/op
                 20          1000  avgt   15   5,224,699.766 ±  198612.245  
ns/op
                 20          5000  avgt   15  55,643,648.154 ±  935800.708  
ns/op
                 50           500  avgt   15   5,731,310.891 ±  289599.505  
ns/op
                 50          1000  avgt   15  11,708,291.589 ±   63063.128  
ns/op
                 50          5000  avgt   15  94,717,768.691 ± 1248511.062  
ns/op
   ```
   
   There are 2 other benchmarks in the set that seem to perform comparably with 
and without the patch.
   
   **KRaftMetadataRequestBenchmark.testRequestToJson**
   **Current Code, unpatched**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15       674.874 ±      19.315  ns/op
                 10          1000  avgt   15       643.048 ±       4.638  ns/op
                 10          5000  avgt   15       696.829 ±      23.999  ns/op
                 20           500  avgt   15       672.617 ±       4.812  ns/op
                 20          1000  avgt   15       674.492 ±       6.206  ns/op
                 20          5000  avgt   15       677.546 ±       2.301  ns/op
                 50           500  avgt   15       682.702 ±       3.841  ns/op
                 50          1000  avgt   15       634.786 ±       6.009  ns/op
                 50          5000  avgt   15       678.107 ±       8.479  ns/op
   ```
   
   **KRaftMetadataRequestBenchmark.testRequestToJson**
   **Updated Code**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15       678.702 ±       3.331  ns/op
                 10          1000  avgt   15       659.232 ±       3.126  ns/op
                 10          5000  avgt   15       678.725 ±       5.893  ns/op
                 20           500  avgt   15       666.064 ±       2.042  ns/op
                 20          1000  avgt   15       670.959 ±       2.950  ns/op
                 20          5000  avgt   15       670.517 ±       2.473  ns/op
                 50           500  avgt   15       672.154 ±       7.125  ns/op
                 50          1000  avgt   15       665.008 ±       2.272  ns/op
                 50          5000  avgt   15       669.210 ±      27.191  ns/op
   ```
   
   **KRaftMetadataRequestBenchmark.testTopicIdInfo**
   **Current Code, unpatched**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15        10.760 ±       0.078  ns/op
                 10          1000  avgt   15        11.118 ±       0.281  ns/op
                 10          5000  avgt   15        10.882 ±       0.192  ns/op
                 20           500  avgt   15        10.822 ±       0.121  ns/op
                 20          1000  avgt   15        10.966 ±       0.569  ns/op
                 20          5000  avgt   15        10.764 ±       0.120  ns/op
                 50           500  avgt   15        10.823 ±       0.081  ns/op
                 50          1000  avgt   15        10.755 ±       0.154  ns/op
                 50          5000  avgt   15        10.694 ±       0.068  ns/op
   ```
   
   **KRaftMetadataRequestBenchmark.testTopicIdInfo**
   **Updated Code**
   ```
   (partitionCount)  (topicCount)  Mode  Cnt         Score         Error  Units
                 10           500  avgt   15        10.493 ±       0.056  ns/op
                 10          1000  avgt   15        10.507 ±       0.059  ns/op
                 10          5000  avgt   15        10.455 ±       0.055  ns/op
                 20           500  avgt   15        10.424 ±       0.035  ns/op
                 20          1000  avgt   15        10.476 ±       0.139  ns/op
                 20          5000  avgt   15        10.454 ±       0.079  ns/op
                 50           500  avgt   15        10.891 ±       0.152  ns/op
                 50          1000  avgt   15        10.509 ±       0.073  ns/op
                 50          5000  avgt   15        10.479 ±       0.059  ns/op
   ```
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [kafka] rondagostino opened a new pull request, #13280: KAFKA-14735: Improve KRaft metadata image change performance at high …

Reply via email to