[
https://issues.apache.org/jira/browse/KAFKA-12525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sagar Rao resolved KAFKA-12525.
-------------------------------
Resolution: Fixed
> Inaccurate task status due to status record interleaving in fast rebalances
> in Connect
> --------------------------------------------------------------------------------------
>
> Key: KAFKA-12525
> URL: https://issues.apache.org/jira/browse/KAFKA-12525
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 2.3.1, 2.4.1, 2.5.1, 2.7.0, 2.6.1
> Reporter: Konstantine Karantasis
> Assignee: Sagar Rao
> Priority: Major
>
> When a task is stopped in Connect it produces an {{UNASSIGNED}} status
> record.
> Equivalently, when a task is started or restarted in Connect it produces an
> {{RUNNING}} status record in the Connect status topic.
> At the same time rebalances are decoupled from task start and stop. These
> operations happen in separate executor outside of the main worker thread that
> performs the rebalance.
> Normally, any delayed and stale {{UNASSIGNED}} status records are fenced by
> the worker that is sending them. This worker is using the
> {{StatusBackingStore#putSafe}} method that will reject any stale status
> messages (called only for {{UNASSIGNED}} or {{FAILED}}) as long as the worker
> is aware of the newer status record that declares a task as {{RUNNING}}.
> In cases of fast consecutive rebalances where a task is revoked from one
> worker and assigned to another one, it has been observed that there is a
> small time window and thus a race condition during which a {{RUNNING}} status
> record in the new generation is produced and is immediately followed by a
> delayed {{UNASSIGNED}} status record belonging to the same or a previous
> generation before the worker that sends this message reads the {{RUNNING}}
> status record that corresponds to the latest generation.
> A couple of options are available to remediate this race condition.
> For example a worker that is has started a task can re-write the {{RUNNING}}
> status message in the topic if it reads a stale {{UNASSIGNED}} message from a
> previous generation (that should have been fenced).
> Another option is to ignore stale {{UNASSIGNED}} message (messages from an
> earlier generation than the one in which the task had {{RUNNING}} status).
> Worth noting that when this race condition takes place, besides the
> inaccurate status representation, the actual execution of the tasks remains
> unaffected (e.g. the tasks are running correctly even though they appear as
> {{UNASSIGNED}}).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)