[ 
https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Quah updated KAFKA-16386:
------------------------------
    Description: 
KAFKA-14402 
([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
 adds verification with the transaction coordinator on Produce and 
TxnOffsetCommit paths as a defense against hanging transactions. For 
compatibility with older clients, retriable errors from the verification step 
are translated to ones already expected and handled by existing clients. When 
verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.

[~dajac] noticed this manifesting as a test failure when 
tests/kafkatest/tests/core/transactions_test.py was run with an older client 
(prior to the fix for KAFKA-16122):
{quote}
{{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
{{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error so 
it transitions to the fatal state.
It seems that there are two cases where the server could return it: (1) When 
the verification request times out or its connections is cut; or (2) in 
{{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because we 
want a retriable error.
{quote}

The first case was triggered as part of the test. The second case happens when 
there is already a verification request ({{AddPartitionsToTxn}}) in flight with 
the same epoch and we want clients to try again when we're not busy.

  was:
KAFKA-14402 
([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
 adds verification with the transaction coordinator on Produce and 
TxnOffsetCommit paths as a defense against hanging transactions. For 
compatibility with older clients, retriable errors from the verification step 
are translated to ones already expected and handled by existing clients. When 
verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.

[~dajac] noticed this manifesting as a test failure when 
tests/kafkatest/tests/core/transactions_test.py was run with an older client 
(pre KAFKA-16122):
{quote}
{{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
{{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error so 
it transitions to the fatal state.
It seems that there are two cases where the server could return it: (1) When 
the verification request times out or its connections is cut; or (2) in 
{{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because we 
want a retriable error.
{quote}

The first case was triggered as part of the test. The second case happens when 
there is already a verification request ({{AddPartitionsToTxn}}) in flight with 
the same epoch and we want clients to try again when we're not busy.


> NETWORK_EXCEPTIONs from transaction verification are not translated
> -------------------------------------------------------------------
>
>                 Key: KAFKA-16386
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16386
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.6.0
>            Reporter: Sean Quah
>            Priority: Minor
>
> KAFKA-14402 
> ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
>  adds verification with the transaction coordinator on Produce and 
> TxnOffsetCommit paths as a defense against hanging transactions. For 
> compatibility with older clients, retriable errors from the verification step 
> are translated to ones already expected and handled by existing clients. When 
> verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.
> [~dajac] noticed this manifesting as a test failure when 
> tests/kafkatest/tests/core/transactions_test.py was run with an older client 
> (prior to the fix for KAFKA-16122):
> {quote}
> {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
> {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error 
> so it transitions to the fatal state.
> It seems that there are two cases where the server could return it: (1) When 
> the verification request times out or its connections is cut; or (2) in 
> {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because 
> we want a retriable error.
> {quote}
> The first case was triggered as part of the test. The second case happens 
> when there is already a verification request ({{AddPartitionsToTxn}}) in 
> flight with the same epoch and we want clients to try again when we're not 
> busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to