[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

2020-09-19 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17198859#comment-17198859
 ] 

Guozhang Wang commented on KAFKA-10485:
---

[~chia7712] If the txn's request timeout value is used as the timeout for the 
corresponding append operation, then if append fails with request-timeout it is 
okay to return request-timeout as well to the client.

That being said, what I'm thinking is actually to introduce a new error code 
other than sending back the original error code, e.g. sth like 
COORDINATOR_CANNOT_COMPLETE_OPERATION to let client retry. Of course this 
requires a protocol bump so that if broker knows the client is old versioned, 
it would still return the old COORDINATOR_NOT_AVAILABLE.

> Use a separate error code for replication related errors
> 
>
> Key: KAFKA-10485
> URL: https://issues.apache.org/jira/browse/KAFKA-10485
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Guozhang Wang
>Priority: Major
>
> Today when coordinator requests involves an append to the internal topic, 
> e.g. a commit / sync-group request sent to the group coordinator, we would 
> capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE 
> to return to the client:
> * UNKNOWN_TOPIC_OR_PARTITION
> * NOT_ENOUGH_REPLICAS
> * NOT_ENOUGH_REPLICAS_AFTER_APPEND
> * REQUEST_TIMED_OUT (for txn coordinator)
> Among those, the second / third case worth reconsideration, because a 
> COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the 
> coordinator unnecessarily with a short backoff time. The forth case is 
> probably also worth revisiting: although the motivation of using 
> COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs 
> unnecessary coordinator re-discovery.
> What would be better, is that for 2)/3) clients would not re-discovery the 
> coordinator, but would just retry with a longer backoff time, and at the same 
> time expose this either through a metric or through warning logs indicate 
> that some other brokers, not the coordinator, is unavailable and causing this 
> operation to be blocked. For 4) clients can just retry without re-discovery. 
> Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10485) Use a separate error code for replication related errors

2020-09-16 Thread Chia-Ping Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197368#comment-17197368
 ] 

Chia-Ping Tsai commented on KAFKA-10485:


REQUEST_TIMED_OUT is viewed as fatal error by TransactionManager (except for 
TxnOffsetCommitHandler). Does it cause trouble on compatibility if we return 
REQUEST_TIMED_OUT to client?

> Use a separate error code for replication related errors
> 
>
> Key: KAFKA-10485
> URL: https://issues.apache.org/jira/browse/KAFKA-10485
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Guozhang Wang
>Priority: Major
>
> Today when coordinator requests involves an append to the internal topic, 
> e.g. a commit / sync-group request sent to the group coordinator, we would 
> capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE 
> to return to the client:
> * UNKNOWN_TOPIC_OR_PARTITION
> * NOT_ENOUGH_REPLICAS
> * NOT_ENOUGH_REPLICAS_AFTER_APPEND
> * REQUEST_TIMED_OUT (for txn coordinator)
> Among those, the second / third case worth reconsideration, because a 
> COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the 
> coordinator unnecessarily with a short backoff time. The forth case is 
> probably also worth revisiting: although the motivation of using 
> COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs 
> unnecessary coordinator re-discovery.
> What would be better, is that for 2)/3) clients would not re-discovery the 
> coordinator, but would just retry with a longer backoff time, and at the same 
> time expose this either through a metric or through warning logs indicate 
> that some other brokers, not the coordinator, is unavailable and causing this 
> operation to be blocked. For 4) clients can just retry without re-discovery. 
> Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)