[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

Guozhang Wang (Jira) Sat, 19 Sep 2020 19:16:00 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198859#comment-17198859
 ]


Guozhang Wang commented on KAFKA-10485:
---------------------------------------

[~chia7712] If the txn's request timeout value is used as the timeout for the 
corresponding append operation, then if append fails with request-timeout it is 
okay to return request-timeout as well to the client.

That being said, what I'm thinking is actually to introduce a new error code 
other than sending back the original error code, e.g. sth like 
COORDINATOR_CANNOT_COMPLETE_OPERATION to let client retry. Of course this 
requires a protocol bump so that if broker knows the client is old versioned, 
it would still return the old COORDINATOR_NOT_AVAILABLE.

> Use a separate error code for replication related errors
> --------------------------------------------------------
>
>                 Key: KAFKA-10485
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10485
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Guozhang Wang
>            Priority: Major
>
> Today when coordinator requests involves an append to the internal topic, 
> e.g. a commit / sync-group request sent to the group coordinator, we would 
> capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE 
> to return to the client:
> * UNKNOWN_TOPIC_OR_PARTITION
> * NOT_ENOUGH_REPLICAS
> * NOT_ENOUGH_REPLICAS_AFTER_APPEND
> * REQUEST_TIMED_OUT (for txn coordinator)
> Among those, the second / third case worth reconsideration, because a 
> COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the 
> coordinator unnecessarily with a short backoff time. The forth case is 
> probably also worth revisiting: although the motivation of using 
> COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs 
> unnecessary coordinator re-discovery.
> What would be better, is that for 2)/3) clients would not re-discovery the 
> coordinator, but would just retry with a longer backoff time, and at the same 
> time expose this either through a metric or through warning logs indicate 
> that some other brokers, not the coordinator, is unavailable and causing this 
> operation to be blocked. For 4) clients can just retry without re-discovery. 
> Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

Reply via email to