Shichao An created KAFKA-19633:
----------------------------------
Summary: Kafka Connect connectors sent out zombie records during
rebalance
Key: KAFKA-19633
URL: https://issues.apache.org/jira/browse/KAFKA-19633
Project: Kafka
Issue Type: Bug
Components: connect
Affects Versions: 3.2.0
Reporter: Shichao An
Hi, we run Debezium connectors on Kafka Connect. We identified several "zombie"
records that are delivered by the connectors during or after the rebalance.
Since the downstream consumers require ordering, this issue breaks several
things where previous primitives were build upon.
Here are an overview of the setup:
* Connector type: Debezium Mongo Connector
* Kafka Connect version: 3.2
* Number of workers: 3-4
* Kafka producer configs: at-least once settings, ack=all, max inflight
requests=1
The following conclusion are based on our investigation:
{quote}When a Kafka Connect worker (part of a connector cluster) is overloaded
or degraded, the connector on it may become temporarily unhealthy. The Kafka
Connect cluster will rebalance the connector by "moving" it to another worker.
When the connector is started on the new worker, the events will resume
normally without any data loss and depending on the previously committed
offsets, there might be a small amount of duplicate events due to replay but
eventually the total ordering is still guaranteed.
However, the producer of the old worker may not have been gracefully shut down.
When the old worker recovered, some old events that were already placed in the
producer's internal queue got sent out to Kafka before the producer was
forcefully closed. This caused the "out-of-band" duplicate events, which we
referred to as "ghost duplicates" or "zombie records.
{quote}
Can you verify our conclusion and do you have any recommendation for the
potential fix or prevention?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)