[ https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198859#comment-17198859 ]
Guozhang Wang commented on KAFKA-10485: --------------------------------------- [~chia7712] If the txn's request timeout value is used as the timeout for the corresponding append operation, then if append fails with request-timeout it is okay to return request-timeout as well to the client. That being said, what I'm thinking is actually to introduce a new error code other than sending back the original error code, e.g. sth like COORDINATOR_CANNOT_COMPLETE_OPERATION to let client retry. Of course this requires a protocol bump so that if broker knows the client is old versioned, it would still return the old COORDINATOR_NOT_AVAILABLE. > Use a separate error code for replication related errors > -------------------------------------------------------- > > Key: KAFKA-10485 > URL: https://issues.apache.org/jira/browse/KAFKA-10485 > Project: Kafka > Issue Type: Improvement > Reporter: Guozhang Wang > Priority: Major > > Today when coordinator requests involves an append to the internal topic, > e.g. a commit / sync-group request sent to the group coordinator, we would > capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE > to return to the client: > * UNKNOWN_TOPIC_OR_PARTITION > * NOT_ENOUGH_REPLICAS > * NOT_ENOUGH_REPLICAS_AFTER_APPEND > * REQUEST_TIMED_OUT (for txn coordinator) > Among those, the second / third case worth reconsideration, because a > COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the > coordinator unnecessarily with a short backoff time. The forth case is > probably also worth revisiting: although the motivation of using > COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs > unnecessary coordinator re-discovery. > What would be better, is that for 2)/3) clients would not re-discovery the > coordinator, but would just retry with a longer backoff time, and at the same > time expose this either through a metric or through warning logs indicate > that some other brokers, not the coordinator, is unavailable and causing this > operation to be blocked. For 4) clients can just retry without re-discovery. > Only for 1) it makes sense to let the clients to re-discover the coordinator. -- This message was sent by Atlassian Jira (v8.3.4#803005)