[jira] [Updated] (KAFKA-17182) Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer

Kirk True (Jira) Mon, 13 Jan 2025 11:08:09 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kirk True updated KAFKA-17182:
------------------------------
    Description: 
h1. Background

{{Consumer}} implementations fetch data from the cluster and temporarily buffer 
it in memory until the user next calls {{{}Consumer.poll(){}}}. When a fetch 
request is being generated, partitions that already have buffered data are not 
included in the fetch request.

The {{ClassicKafkaConsumer}} performs much of its fetch logic and network I/O 
in the application thread. On {{{}poll(){}}}, if there is any locally-buffered 
data, the {{ClassicKafkaConsumer}} does not fetch _any_ new data and simply 
returns the buffered data to the user from {{{}poll(){}}}.

On the other hand, the {{AsyncKafkaConsumer}} consumer splits its logic and 
network I/O between two threads, which results in a potential race condition 
during fetch. The {{AsyncKafkaConsumer}} also checks for buffered data on its 
application thread. If it finds there is none, it signals the background thread 
to create a fetch request. However, it's possible for the background thread to 
receive data from a previous fetch and buffer it before the fetch request logic 
starts. When that occurs, as the background thread creates a new fetch request, 
it skips any buffered data, which has the unintended result that those 
partitions get added to the fetch request's "to remove" set. This signals to 
the broker to remove those partitions from its internal cache.

This issue is technically possible in the {{ClassicKafkaConsumer}} too, since 
the heartbeat thread performs network I/O in addition to the application 
thread. However, because of the frequency at which the 
{{{}AsyncKafkaConsumer{}}}'s background thread runs, it is ~100x more likely to 
happen.
h1. Options

The core decision is: what should the background thread do if it is asked to 
create a fetch request and it discovers there's buffered data. There were 
multiple proposals to address this issue in the {{{}AsyncKafkaConsumer{}}}. 
Among them are:
 # The background thread should omit buffered partitions from the fetch request 
as before (this is the existing behavior)
 # The background thread should skip the fetch request generation entirely if 
there are _any_ buffered partitions
 # The background thread should include buffered partitions in the fetch 
request, but use a small “max bytes” value
 # The background thread should skip fetching from the nodes that have buffered 
partitions

Option 3 won out. The change in {{AsyncKafkaConsumer}} is to include in the 
fetch request any partition with buffered data. By using a "max bytes" size of 
1, this should cause the fetch response to return as little data as possible. 
In that way, the consumer doesn't buffer too much data on the client before it 
can be returned from {{{}poll(){}}}.
h1. Testing
h2. Eviction rate testing

Here are the results of our internal stress testing:
 * {{{}ClassicKafkaConsumer{}}}—after the initial spike during test start up, 
the average rate settles down to ~0.14 evictions/second 
[!https://private-user-images.githubusercontent.com/92057/389141955-b13c46a2-226f-44c9-a8c5-d6dc0d38d40e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTUtYjEzYzQ2YTItMjI2Zi00NGM5LWE4YzUtZDZkYzBkMzhkNDBlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlMzc3MGI0YzQ3YjMzNDAyMzk4ZTVjNDc0MzMzNWQ3OWI5MzZhN2M4ZmJhMTNkMzU2ODhiMzg4YmM2NDI5NGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EyVhI7-v_crz8R465PVuKqZoqzDoImal8SBlCOFitCY|width=1111,height=400!|https://private-user-images.githubusercontent.com/92057/389141955-b13c46a2-226f-44c9-a8c5-d6dc0d38d40e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTUtYjEzYzQ2YTItMjI2Zi00NGM5LWE4YzUtZDZkYzBkMzhkNDBlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlMzc3MGI0YzQ3YjMzNDAyMzk4ZTVjNDc0MzMzNWQ3OWI5MzZhN2M4ZmJhMTNkMzU2ODhiMzg4YmM2NDI5NGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EyVhI7-v_crz8R465PVuKqZoqzDoImal8SBlCOFitCY]
 * {{{}AsyncKafkaConsumer{}}}, (w/o fix)—after startup, the evictions still 
settle down, but they are about 100x higher than the {{ClassicKafkaConsumer}} 
at ~1.48 evictions/second 
[!https://private-user-images.githubusercontent.com/92057/389141959-dca5ff7f-74bd-4174-b6e6-39c4e8479684.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTktZGNhNWZmN2YtNzRiZC00MTc0LWI2ZTYtMzljNGU4NDc5Njg0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQwOTVkMTk3ZTYyMTFlYjlkMmNkN2MwYzJhNjVhZWM1MGNmMWZjYzQ0YzRmZGRkNTFjZWQ3MTc4ZWY0OTk1ZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.66SRL4hvz-2omy0NGwbb5apktUAkoJ5Oh7IrgFtG-N4|width=1106,height=400!|https://private-user-images.githubusercontent.com/92057/389141959-dca5ff7f-74bd-4174-b6e6-39c4e8479684.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTktZGNhNWZmN2YtNzRiZC00MTc0LWI2ZTYtMzljNGU4NDc5Njg0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQwOTVkMTk3ZTYyMTFlYjlkMmNkN2MwYzJhNjVhZWM1MGNmMWZjYzQ0YzRmZGRkNTFjZWQ3MTc4ZWY0OTk1ZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.66SRL4hvz-2omy0NGwbb5apktUAkoJ5Oh7IrgFtG-N4]
 * {{AsyncKafkaConsumer}} (w/ fix)—the eviction rate is now closer to the 
{{ClassicKafkaConsumer}} at ~0.22 evictions/second 
[!https://private-user-images.githubusercontent.com/92057/389141958-19009791-d63e-411d-96ed-b49605f93325.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTgtMTkwMDk3OTEtZDYzZS00MTFkLTk2ZWQtYjQ5NjA1ZjkzMzI1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM4ODU1ODNkYmEzYTliMjRkZGY3MDMyMGExYmZmY2VjNzM0OTJkMDNmZDMyZmY0M2QwMmRhOWRmNDRiODY2NWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.8D90EW8XJBDJUANqhlHtxmJgKToKJWKqcfP3EiJmbPc|width=1110,height=400!|https://private-user-images.githubusercontent.com/92057/389141958-19009791-d63e-411d-96ed-b49605f93325.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTgtMTkwMDk3OTEtZDYzZS00MTFkLTk2ZWQtYjQ5NjA1ZjkzMzI1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM4ODU1ODNkYmEzYTliMjRkZGY3MDMyMGExYmZmY2VjNzM0OTJkMDNmZDMyZmY0M2QwMmRhOWRmNDRiODY2NWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.8D90EW8XJBDJUANqhlHtxmJgKToKJWKqcfP3EiJmbPc]

h2. {{EndToEndLatency}} testing

The bundled {{EndToEndLatency}} test runner was executed on a single machine 
using Docker. The {{apache/kafka:latest}} Docker image was used and either the 
{{cluster/combined/plaintext/docker-compose.yml}} or 
{{single-node/plaintext/docker-compose.yml}} Docker Compose configuration 
files, depending on the test. The Docker containers were recreated from scratch 
before each test.

A single topic was created with 30 partitions and with a replication factor of 
either 1 or 3, depending on a single- or multi-node setup.

For each of the test runs these argument values were used:
 * Message count: 100000
 * {{{}acks{}}}: 1
 * Message size: 128 bytes

A configuration file which contained a single configuration value of 
{{group.protocol=<$group_protocol>}} was also provided to the test, where 
{{$group_protocol}} was either {{CLASSIC}} or {{{}CONSUMER{}}}.
h3. Test results
h4. Test 1—{{{}CLASSIC{}}} group protocol, cluster size: 3 nodes, replication 
factor: 3
||Metric||{{trunk}}||PR||
|Average latency|1.4901|1.4871|
|50th percentile|1|1|
|99th percentile|3|3|
|99.9th percentile|6|6|
h4. Test 2—{{{}CONSUMER{}}} group protocol, cluster size: 3 nodes, replication 
factor: 3
||Metric||{{trunk}}||PR||
|Average latency|1.4704|1.4807|
|50th percentile|1|1|
|99th percentile|3|3|
|99.9th percentile|6|7|
h4. Test 3—{{{}CLASSIC{}}} group protocol, cluster size: 1 node, replication 
factor: 1
||Metric||{{trunk}}||PR||
|Average latency|1.0777|1.0193|
|50th percentile|1|1|
|99th percentile|2|2|
|99.9th percentile|5|4|
h4. Test 4—{{{}CONSUMER{}}} group protocol, cluster size: 1 node, replication 
factor: 1
||Metric||{{trunk}}||PR||
|Average latency|1.0937|1.0503|
|50th percentile|1|1|
|99th percentile|2|2|
|99.9th percentile|4|4|
h3. Conclusion

These tests did not reveal any significant differences between the current 
fetcher logic on {{trunk}} and the one proposed in this PR. Addition test runs 
using larger message counts and/or larger message sizes did not affect the 
result.

  was:
In stress testing the new consumer, the new consumer is evicting fetch sessions 
on the broker much more frequently than expected. There is an ongoing 
investigation into this behavior, but it appears to stem from a race condition 
due to the design of the new consumer.

In the background thread, fetch requests are sent in a near continuous fashion 
for partitions that are "fetchable." A timing bug appears to cause partitions 
to be "unfetchable," which then causes them to end up in the "removed" set of 
partitions. The broker then removes them from the fetch session, which causes 
the number of remaining partitions for that session to drop below a threshold 
that allows it to be evicted by another competing session. Within a few 
milliseconds, though, the partitions become "fetchable" again, and are added to 
the "added" set of partitions on the next fetch request. This causes thrashing 
on both the client and broker sides as both are handling a steady stream of 
evictions, which negatively affects consumption throughput.


> Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-17182
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17182
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.8.0
>            Reporter: Kirk True
>            Assignee: Kirk True
>            Priority: Blocker
>              Labels: consumer-threading-refactor
>             Fix For: 4.0.0
>
>
> h1. Background
> {{Consumer}} implementations fetch data from the cluster and temporarily 
> buffer it in memory until the user next calls {{{}Consumer.poll(){}}}. When a 
> fetch request is being generated, partitions that already have buffered data 
> are not included in the fetch request.
> The {{ClassicKafkaConsumer}} performs much of its fetch logic and network I/O 
> in the application thread. On {{{}poll(){}}}, if there is any 
> locally-buffered data, the {{ClassicKafkaConsumer}} does not fetch _any_ new 
> data and simply returns the buffered data to the user from {{{}poll(){}}}.
> On the other hand, the {{AsyncKafkaConsumer}} consumer splits its logic and 
> network I/O between two threads, which results in a potential race condition 
> during fetch. The {{AsyncKafkaConsumer}} also checks for buffered data on its 
> application thread. If it finds there is none, it signals the background 
> thread to create a fetch request. However, it's possible for the background 
> thread to receive data from a previous fetch and buffer it before the fetch 
> request logic starts. When that occurs, as the background thread creates a 
> new fetch request, it skips any buffered data, which has the unintended 
> result that those partitions get added to the fetch request's "to remove" 
> set. This signals to the broker to remove those partitions from its internal 
> cache.
> This issue is technically possible in the {{ClassicKafkaConsumer}} too, since 
> the heartbeat thread performs network I/O in addition to the application 
> thread. However, because of the frequency at which the 
> {{{}AsyncKafkaConsumer{}}}'s background thread runs, it is ~100x more likely 
> to happen.
> h1. Options
> The core decision is: what should the background thread do if it is asked to 
> create a fetch request and it discovers there's buffered data. There were 
> multiple proposals to address this issue in the {{{}AsyncKafkaConsumer{}}}. 
> Among them are:
>  # The background thread should omit buffered partitions from the fetch 
> request as before (this is the existing behavior)
>  # The background thread should skip the fetch request generation entirely if 
> there are _any_ buffered partitions
>  # The background thread should include buffered partitions in the fetch 
> request, but use a small “max bytes” value
>  # The background thread should skip fetching from the nodes that have 
> buffered partitions
> Option 3 won out. The change in {{AsyncKafkaConsumer}} is to include in the 
> fetch request any partition with buffered data. By using a "max bytes" size 
> of 1, this should cause the fetch response to return as little data as 
> possible. In that way, the consumer doesn't buffer too much data on the 
> client before it can be returned from {{{}poll(){}}}.
> h1. Testing
> h2. Eviction rate testing
> Here are the results of our internal stress testing:
>  * {{{}ClassicKafkaConsumer{}}}—after the initial spike during test start up, 
> the average rate settles down to ~0.14 evictions/second 
> [!https://private-user-images.githubusercontent.com/92057/389141955-b13c46a2-226f-44c9-a8c5-d6dc0d38d40e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTUtYjEzYzQ2YTItMjI2Zi00NGM5LWE4YzUtZDZkYzBkMzhkNDBlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlMzc3MGI0YzQ3YjMzNDAyMzk4ZTVjNDc0MzMzNWQ3OWI5MzZhN2M4ZmJhMTNkMzU2ODhiMzg4YmM2NDI5NGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EyVhI7-v_crz8R465PVuKqZoqzDoImal8SBlCOFitCY|width=1111,height=400!|https://private-user-images.githubusercontent.com/92057/389141955-b13c46a2-226f-44c9-a8c5-d6dc0d38d40e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTUtYjEzYzQ2YTItMjI2Zi00NGM5LWE4YzUtZDZkYzBkMzhkNDBlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWVlMzc3MGI0YzQ3YjMzNDAyMzk4ZTVjNDc0MzMzNWQ3OWI5MzZhN2M4ZmJhMTNkMzU2ODhiMzg4YmM2NDI5NGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EyVhI7-v_crz8R465PVuKqZoqzDoImal8SBlCOFitCY]
>  * {{{}AsyncKafkaConsumer{}}}, (w/o fix)—after startup, the evictions still 
> settle down, but they are about 100x higher than the {{ClassicKafkaConsumer}} 
> at ~1.48 evictions/second 
> [!https://private-user-images.githubusercontent.com/92057/389141959-dca5ff7f-74bd-4174-b6e6-39c4e8479684.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTktZGNhNWZmN2YtNzRiZC00MTc0LWI2ZTYtMzljNGU4NDc5Njg0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQwOTVkMTk3ZTYyMTFlYjlkMmNkN2MwYzJhNjVhZWM1MGNmMWZjYzQ0YzRmZGRkNTFjZWQ3MTc4ZWY0OTk1ZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.66SRL4hvz-2omy0NGwbb5apktUAkoJ5Oh7IrgFtG-N4|width=1106,height=400!|https://private-user-images.githubusercontent.com/92057/389141959-dca5ff7f-74bd-4174-b6e6-39c4e8479684.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTktZGNhNWZmN2YtNzRiZC00MTc0LWI2ZTYtMzljNGU4NDc5Njg0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQwOTVkMTk3ZTYyMTFlYjlkMmNkN2MwYzJhNjVhZWM1MGNmMWZjYzQ0YzRmZGRkNTFjZWQ3MTc4ZWY0OTk1ZTMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.66SRL4hvz-2omy0NGwbb5apktUAkoJ5Oh7IrgFtG-N4]
>  * {{AsyncKafkaConsumer}} (w/ fix)—the eviction rate is now closer to the 
> {{ClassicKafkaConsumer}} at ~0.22 evictions/second 
> [!https://private-user-images.githubusercontent.com/92057/389141958-19009791-d63e-411d-96ed-b49605f93325.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTgtMTkwMDk3OTEtZDYzZS00MTFkLTk2ZWQtYjQ5NjA1ZjkzMzI1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM4ODU1ODNkYmEzYTliMjRkZGY3MDMyMGExYmZmY2VjNzM0OTJkMDNmZDMyZmY0M2QwMmRhOWRmNDRiODY2NWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.8D90EW8XJBDJUANqhlHtxmJgKToKJWKqcfP3EiJmbPc|width=1110,height=400!|https://private-user-images.githubusercontent.com/92057/389141958-19009791-d63e-411d-96ed-b49605f93325.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzY3OTU0OTksIm5iZiI6MTczNjc5NTE5OSwicGF0aCI6Ii85MjA1Ny8zODkxNDE5NTgtMTkwMDk3OTEtZDYzZS00MTFkLTk2ZWQtYjQ5NjA1ZjkzMzI1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAxMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMTEzVDE5MDYzOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM4ODU1ODNkYmEzYTliMjRkZGY3MDMyMGExYmZmY2VjNzM0OTJkMDNmZDMyZmY0M2QwMmRhOWRmNDRiODY2NWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.8D90EW8XJBDJUANqhlHtxmJgKToKJWKqcfP3EiJmbPc]
> h2. {{EndToEndLatency}} testing
> The bundled {{EndToEndLatency}} test runner was executed on a single machine 
> using Docker. The {{apache/kafka:latest}} Docker image was used and either 
> the {{cluster/combined/plaintext/docker-compose.yml}} or 
> {{single-node/plaintext/docker-compose.yml}} Docker Compose configuration 
> files, depending on the test. The Docker containers were recreated from 
> scratch before each test.
> A single topic was created with 30 partitions and with a replication factor 
> of either 1 or 3, depending on a single- or multi-node setup.
> For each of the test runs these argument values were used:
>  * Message count: 100000
>  * {{{}acks{}}}: 1
>  * Message size: 128 bytes
> A configuration file which contained a single configuration value of 
> {{group.protocol=<$group_protocol>}} was also provided to the test, where 
> {{$group_protocol}} was either {{CLASSIC}} or {{{}CONSUMER{}}}.
> h3. Test results
> h4. Test 1—{{{}CLASSIC{}}} group protocol, cluster size: 3 nodes, replication 
> factor: 3
> ||Metric||{{trunk}}||PR||
> |Average latency|1.4901|1.4871|
> |50th percentile|1|1|
> |99th percentile|3|3|
> |99.9th percentile|6|6|
> h4. Test 2—{{{}CONSUMER{}}} group protocol, cluster size: 3 nodes, 
> replication factor: 3
> ||Metric||{{trunk}}||PR||
> |Average latency|1.4704|1.4807|
> |50th percentile|1|1|
> |99th percentile|3|3|
> |99.9th percentile|6|7|
> h4. Test 3—{{{}CLASSIC{}}} group protocol, cluster size: 1 node, replication 
> factor: 1
> ||Metric||{{trunk}}||PR||
> |Average latency|1.0777|1.0193|
> |50th percentile|1|1|
> |99th percentile|2|2|
> |99.9th percentile|5|4|
> h4. Test 4—{{{}CONSUMER{}}} group protocol, cluster size: 1 node, replication 
> factor: 1
> ||Metric||{{trunk}}||PR||
> |Average latency|1.0937|1.0503|
> |50th percentile|1|1|
> |99th percentile|2|2|
> |99.9th percentile|4|4|
> h3. Conclusion
> These tests did not reveal any significant differences between the current 
> fetcher logic on {{trunk}} and the one proposed in this PR. Addition test 
> runs using larger message counts and/or larger message sizes did not affect 
> the result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17182) Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer

Reply via email to