[
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846763#comment-17846763
]
Justine Olshan commented on KAFKA-16692:
----------------------------------------
Hey [~akaltsikis] yeah, those are the release notes for 3.6. We can probably
edit the kafka-site repo to get the change to show up in real time, but
updating via the kafka repo requires waiting for the next release for the site
to update.
> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not
> enabled when upgrading from kafka 3.5 to 3.6
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 3.7.0, 3.6.1, 3.8
> Reporter: Johnson Okorie
> Assignee: Justine Olshan
> Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled
> and hence creating transactions. As we replaced brokers with the new
> binaries, we observed lots of clients in the cluster experiencing the
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=<client>,
> transactionalId=<transactionalId>] Got error produce response with
> correlation id 6402937 on topic-partition <topic-partition>, retrying
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still
> running Kafka version 3.5.2:
>
> {code:java}
> message:
> Closing socket for <ChannelId> because of error
> exception_exception_class:
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not
> enabled
> exception_stacktrace:
> org.apache.kafka.common.errors.InvalidRequestException: Received request api
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for
> node 1043 with a network exception.{code}
>
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043
> being disconnected (elapsed time since creation: 11ms, elapsed time since
> send: 4ms, request timeout: 30000ms){code}
> We started investigating this issue and digging through the changes in 3.6,
> we came across some changes introduced as part of KAFKA-14402 that we thought
> might lead to this behaviour.
> First we could see that _transaction.partition.verification.enable_ is
> enabled by default and enables a new code path that culminates in we sending
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x]
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be
> possible as the following code paths should prevent version 4
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I
> am wondering if it is possible that because _discoverBrokerVersions_ is set
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client
> that has _discoverBrokerVersions_ set to {_}false{_}.
> I was hoping I could get some assistance debugging this issue. Happy to
> provide any additional information needed.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)