[
https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198859#comment-17198859
]
Guozhang Wang commented on KAFKA-10485:
---
[~chia7712] If the txn's request timeout value is used as the timeout for the
corresponding append operation, then if append fails with request-timeout it is
okay to return request-timeout as well to the client.
That being said, what I'm thinking is actually to introduce a new error code
other than sending back the original error code, e.g. sth like
COORDINATOR_CANNOT_COMPLETE_OPERATION to let client retry. Of course this
requires a protocol bump so that if broker knows the client is old versioned,
it would still return the old COORDINATOR_NOT_AVAILABLE.
> Use a separate error code for replication related errors
>
>
> Key: KAFKA-10485
> URL: https://issues.apache.org/jira/browse/KAFKA-10485
> Project: Kafka
> Issue Type: Improvement
>Reporter: Guozhang Wang
>Priority: Major
>
> Today when coordinator requests involves an append to the internal topic,
> e.g. a commit / sync-group request sent to the group coordinator, we would
> capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE
> to return to the client:
> * UNKNOWN_TOPIC_OR_PARTITION
> * NOT_ENOUGH_REPLICAS
> * NOT_ENOUGH_REPLICAS_AFTER_APPEND
> * REQUEST_TIMED_OUT (for txn coordinator)
> Among those, the second / third case worth reconsideration, because a
> COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the
> coordinator unnecessarily with a short backoff time. The forth case is
> probably also worth revisiting: although the motivation of using
> COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs
> unnecessary coordinator re-discovery.
> What would be better, is that for 2)/3) clients would not re-discovery the
> coordinator, but would just retry with a longer backoff time, and at the same
> time expose this either through a metric or through warning logs indicate
> that some other brokers, not the coordinator, is unavailable and causing this
> operation to be blocked. For 4) clients can just retry without re-discovery.
> Only for 1) it makes sense to let the clients to re-discover the coordinator.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)