Hi Feiyan,

It did look like a bug. Could you open a bug in JIRA here
<https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-13735?filter=allopenissues>
?
One thing I don't fully understand from the unit test is that, you have
output "error distributions: $tpsWithoutResize".
I thought the bug is about the unevenly partition distribution after
resizing fetch thread, what do you mean by "error distribution"?
Could you please also elaborate in JIRA ticket?

Also, welcome to submit a PR to it, since you've already had fix and test
ready! :)

Thank you.
Luke



On Sat, Mar 26, 2022 at 2:56 PM Feiyan Yu <dbtr...@gmail.com> wrote:

> Howdy!
>
> I found that the method of resizing fetching thread had one potential bug
> which lead to imbalanced partitions distribution for each thread after
> changing thread number on dynamic configuration.
>
> To figure it out, I added a primitive unit test method "testResize" in
> "AbstractFetcherThreadTest" to simulate the replica fetchers resizing
> process from 10 threads to 60 threads with originally 10 topics and 100
> partitions each. I designed the test to show that after the resizing
> process, all partitions should be redistributed correctly based on the new
> thread number. However, the test failed because when I tried to compare the
> "fetcherThreadMap" with the fetcherId for each topic-partition, the
> fetcherId mismatched! The unit test I added is in this commit (
> https://github.com/yufeiyan1220/kafka/commit/eb99b7499b416cdeb44c2ccd3ea55a1e38ea3d60),
> and the standard output of the unit test showed in attachment.
>
> I doubt that maybe it is because the method "addFetcherForPartitions"
> which maybe adds some new fetchers to "fetcherThreadMap" called in the
> block of iterating the "fetcherThreadMap", and the iterator ignore some
> fetchers, which leads to some of the fetchers remain their topic
> partitions. And it leads to the imbalanced partition distribution pattern
> in resizing.
>
> To solve this issue I make a mutable map to store all partitions and its
> fetch offset, and then add it back once out of the iteration. I make
> another commit (
> https://github.com/yufeiyan1220/kafka/commit/0a793dfca2ab9b8e8b45ba5693359960f3e306eb),
> and the new resize method passed the unit test.
>
> I'm not sure whether it is an issue that Kafka Community need to fix. But
> for me, it affects the fetching efficiency when I try to deploy Kafka cross
> regions with high network latency.
>
> I'd really appreciate to hear from you!
>
>

Reply via email to