[ 
https://issues.apache.org/jira/browse/KAFKA-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-17862:
------------------------------
    Fix Version/s: 4.2.0

> [buffer pool] corruption during buffer reuse from the pool
> ----------------------------------------------------------
>
>                 Key: KAFKA-17862
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17862
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, core, producer 
>    Affects Versions: 3.7.1
>            Reporter: Bharath Vissapragada
>            Assignee: xuanzhang gong
>            Priority: Blocker
>             Fix For: 4.2.0
>
>         Attachments: client-config.txt
>
>
> We noticed malformed batches from the Kafka Java client + Redpanda under 
> certain conditions that caused excessive client retries and we narrowed it 
> down to a client bug related to corruption of buffers reused from the buffer 
> pool. We were able to reproduce it with Kafka brokers too, so we are fairly 
> certain the bug is on the client.
> (Attached the full client config, fwiw)
> We narrowed it down to a race condition between produce requests and failed 
> batch expiration. If the network flush of produce request races with the 
> expiration, the produce batch that the request uses is corrupted, so a 
> malformed batch is sent to the broker.
> The expiration is triggered by a timeout 
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L392C13-L392C22]
> that eventually deallocates the batch
> [https://github.com/apache/kafka/blob/2c6fb6c54472e90ae17439e62540ef3cb0426fe3/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L773]
> adding it back to the buffer pool
> [https://github.com/apache/kafka/blob/661bed242e8d7269f134ea2f6a24272ce9b720e9/clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java#L1054]
> Now it is probably all zeroed out or there is a competing producer that 
> requests a new append that reuses this freed up buffer and starts writing to 
> it corrupting it's contents.
> If there is racing network flush of a produce batch backed by this buffer, a 
> corrupt batch is sent to the broker resulting in a CRC mismatch. 
> This issue can be easily reproduced in a simulated environment that triggers 
> frequent timeouts (eg: lower timeouts) and then use a producer with high-ish 
> throughput that can cause longer queues (hence higher chances of expiration) 
> and frequent buffer reuse from the pool (deadly combination :))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to