lhotari commented on issue #25046: URL: https://github.com/apache/pulsar/issues/25046#issuecomment-3630829888
> Currently, my goal is to identify the root cause of partition imbalance in version 3.0.10. @g0715158 In the OSS project, we don't maintain specific versions such as 3.0.10. 3.0.x continues to be maintained, but the latest released version is 3.0.15 . For this case, please attempt to reproduce with 4.1.2 as I suggested before. If you cannot reproduce, that's a lot more information for you to identify the root cause in version 3.0.10. > Approximately 1000 consumers per partition experience performance degradation after consuming for about 9 minutes, and partition imbalance can be observed from the console. I would like to ask if such a situation has occurred in any existing issues? Yes, I've seen that happen. The imbalance is common, but the case where consuming stops completely might be a different issue such as #24926 . However, in the stats that you shared, there are many cases where the backlog is 0 for the partitions that have out rate of 0. Is this test case of a scenario where producers are producing actively and consumers are following? Or is it a "catch-up scenario" where there's existing backlog which consumers consume. In your test scenario, you didn't mention anything about the client side. How many separate client instances and/or client connections do you have? How well is the client side tuned? For example, https://pulsar.apache.org/docs/next/client-libraries-java-setup/#java-client-performance ? When you are creating a large number of Java client instances in a single JVM, it's necessary to share resources. There's an example in branch-4.0 in this test: https://github.com/apache/pulsar/blob/branch-4.0/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L238-L280 . For 4.1+, there's PIP-234 `PulsarClientSharedResources`: https://github.com/apache/pulsar/blob/270120ce6e33e5a084397ca31186f1bb87835e48/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L103-L123 If you are actively producing to partitions in a test case, one common issue for test scenarios is the producing side. It's also possible that the producing side doesn't produce evenly across partitions. One way to solve this is to produce individually to specific partitions (*-partition-0, *-partition-1, ...) in the load generator and have a sufficient amount of separate nodes for producing the messages so that the bottleneck isn't in producing clients. On the producer side, using a multi-topic (partitioned) producer will also have more variance across partitions due to the default use of https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/RoundRobinPartitionMessageRouterImpl.java . The setting can be controlled with https://github.com/apache/pulsar/blob/cc5e479d63103f81e3af833e8b06227d1a6563e1/pulsar-client-api/src/main/java/org/apache/pulsar/client/api/ProducerBuilder.java#L462-L474 . The defaults are time based for both routing and batching. For testing purposes, it could be better to use a count based routing if partitioned producer is used and configure batching with a long `batchingMaxPublishDelay` and use `batchingMaxMessages` to achieve similar sized batches each time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
