lhotari commented on issue #25046:
URL: https://github.com/apache/pulsar/issues/25046#issuecomment-3630829888

   > Currently, my goal is to identify the root cause of partition imbalance in 
version 3.0.10. 
   
   @g0715158 In the OSS project, we don't maintain specific versions such as 
3.0.10. 3.0.x continues to be maintained, but the latest released version is 
3.0.15 . 
   For this case, please attempt to reproduce with 4.1.2 as I suggested before. 
If you cannot reproduce, that's a lot more information for you to identify the 
root cause in version 3.0.10.
   
   > Approximately 1000 consumers per partition experience performance 
degradation after consuming for about 9 minutes, and partition imbalance can be 
observed from the console. I would like to ask if such a situation has occurred 
in any existing issues?
   
   Yes, I've seen that happen. The imbalance is common, but the case where 
consuming stops completely might be a different issue such as #24926 .
   
   However, in the stats that you shared, there are many cases where the 
backlog is 0 for the partitions that have out rate of 0.
   Is this test case of a scenario where producers are producing actively and 
consumers are following? Or is it a "catch-up scenario" where there's existing 
backlog which consumers consume.
   
   In your test scenario, you didn't mention anything about the client side. 
How many separate client instances and/or client connections do you have? How 
well is the client side tuned? For example, 
https://pulsar.apache.org/docs/next/client-libraries-java-setup/#java-client-performance
 ?
   
   When you are creating a large number of Java client instances in a single 
JVM, it's necessary to share resources. There's an example in branch-4.0 in 
this test: 
https://github.com/apache/pulsar/blob/branch-4.0/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L238-L280
 . 
   For 4.1+, there's PIP-234 `PulsarClientSharedResources`: 
https://github.com/apache/pulsar/blob/270120ce6e33e5a084397ca31186f1bb87835e48/pulsar-broker/src/test/java/org/apache/pulsar/client/api/PatternConsumerBackPressureMultipleConsumersTest.java#L103-L123
   
   If you are actively producing to partitions in a test case, one common issue 
for test scenarios is the producing side. It's also possible that the producing 
side doesn't produce evenly across partitions. One way to solve this is to 
produce individually to specific partitions (*-partition-0, *-partition-1, ...) 
in the load generator and have a sufficient amount of separate nodes for 
producing the messages so that the bottleneck isn't in producing clients.
   On the producer side, using a multi-topic (partitioned) producer will also 
have more variance across partitions due to the default use of 
https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/RoundRobinPartitionMessageRouterImpl.java
 .
   The setting can be controlled with 
https://github.com/apache/pulsar/blob/cc5e479d63103f81e3af833e8b06227d1a6563e1/pulsar-client-api/src/main/java/org/apache/pulsar/client/api/ProducerBuilder.java#L462-L474
 .
   The defaults are time based for both routing and batching. For testing 
purposes, it could be better to use a count based routing if partitioned 
producer is used and configure batching with a long `batchingMaxPublishDelay` 
and use `batchingMaxMessages` to achieve similar sized batches each time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to