Guozhang Wang created KAFKA-10485:
-------------------------------------
Summary: Use a separate error code for replication related errors
Key: KAFKA-10485
URL: https://issues.apache.org/jira/browse/KAFKA-10485
Project: Kafka
Issue Type: Improvement
Reporter: Guozhang Wang
Today when coordinator requests involves an append to the internal topic, e.g.
a commit / sync-group request sent to the group coordinator, we would capture
the following error and translate them as a COORDINATOR_NOT_AVAILABLE to return
to the client:
* UNKNOWN_TOPIC_OR_PARTITION
* NOT_ENOUGH_REPLICAS
* NOT_ENOUGH_REPLICAS_AFTER_APPEND
* REQUEST_TIMED_OUT (for txn coordinator)
Among those, the second / third case worth reconsideration, because a
COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the
coordinator unnecessarily with a short backoff time. The forth case is probably
also worth revisiting: although the motivation of using
COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs
unnecessary coordinator re-discovery.
What would be better, is that for 2)/3) clients would not re-discovery the
coordinator, but would just retry with a longer backoff time, and at the same
time expose this either through a metric or through warning logs indicate that
some other brokers, not the coordinator, is unavailable and causing this
operation to be blocked. For 4) clients can just retry without re-discovery.
Only for 1) it makes sense to let the clients to re-discover the coordinator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)