[jira] [Created] (KAFKA-15654) Address Transactions Errors
Justine Olshan created KAFKA-15654: -- Summary: Address Transactions Errors Key: KAFKA-15654 URL: https://issues.apache.org/jira/browse/KAFKA-15654 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan In addition to the work in KIP-691, I propose we handle and clean up transactional error handling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15655) Consider making transactional apis more compatible with topic IDs
Justine Olshan created KAFKA-15655: -- Summary: Consider making transactional apis more compatible with topic IDs Key: KAFKA-15655 URL: https://issues.apache.org/jira/browse/KAFKA-15655 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan Some ideas include adding topic ID to AddPartitions and other topic partition specific APIs. Adding topic ID as a tagged field in the transactional state logs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15784) Ensure atomicity of in memory update and write when transactionally committing offsets
Justine Olshan created KAFKA-15784: -- Summary: Ensure atomicity of in memory update and write when transactionally committing offsets Key: KAFKA-15784 URL: https://issues.apache.org/jira/browse/KAFKA-15784 Project: Kafka Issue Type: Sub-task Affects Versions: 3.7.0 Reporter: Justine Olshan Assignee: Justine Olshan [https://github.com/apache/kafka/pull/14370] (KAFKA-15449) removed the locking around validating, updating state, and writing to the log transactional offset commits. (The verification causes us to release the lock) This was discovered in the discussion of [https://github.com/apache/kafka/pull/14629] (KAFKA-15653). Since KAFKA-15653 is needed for 3.5.1, it makes sense to address the locking issue separately with this ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15797) Flaky test EosV2UpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosV2[true]
Justine Olshan created KAFKA-15797: -- Summary: Flaky test EosV2UpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosV2[true] Key: KAFKA-15797 URL: https://issues.apache.org/jira/browse/KAFKA-15797 Project: Kafka Issue Type: Bug Reporter: Justine Olshan I found two recent failures: [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/EosV2UpgradeIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldUpgradeFromEosAlphaToEosV2_true_/] [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/EosV2UpgradeIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldUpgradeFromEosAlphaToEosV2_true__2/] Output generally looks like: {code:java} java.lang.AssertionError: Did not receive all 138 records from topic multiPartitionOutputTopic within 6 ms, currently accumulated data is [KeyValue(0, 0), KeyValue(0, 1), KeyValue(0, 3), KeyValue(0, 6), KeyValue(0, 10), KeyValue(0, 15), KeyValue(0, 21), KeyValue(0, 28), KeyValue(0, 36), KeyValue(0, 45), KeyValue(0, 55), KeyValue(0, 66), KeyValue(0, 78), KeyValue(0, 91), KeyValue(0, 55), KeyValue(0, 66), KeyValue(0, 78), KeyValue(0, 91), KeyValue(0, 105), KeyValue(0, 120), KeyValue(0, 136), KeyValue(0, 153), KeyValue(0, 171), KeyValue(0, 190), KeyValue(3, 0), KeyValue(3, 1), KeyValue(3, 3), KeyValue(3, 6), KeyValue(3, 10), KeyValue(3, 15), KeyValue(3, 21), KeyValue(3, 28), KeyValue(3, 36), KeyValue(3, 45), KeyValue(3, 55), KeyValue(3, 66), KeyValue(3, 78), KeyValue(3, 91), KeyValue(3, 105), KeyValue(3, 120), KeyValue(3, 136), KeyValue(3, 153), KeyValue(3, 171), KeyValue(3, 190), KeyValue(3, 190), KeyValue(3, 210), KeyValue(3, 231), KeyValue(3, 253), KeyValue(3, 276), KeyValue(3, 300), KeyValue(3, 325), KeyValue(3, 351), KeyValue(3, 378), KeyValue(3, 406), KeyValue(3, 435), KeyValue(1, 0), KeyValue(1, 1), KeyValue(1, 3), KeyValue(1, 6), KeyValue(1, 10), KeyValue(1, 15), KeyValue(1, 21), KeyValue(1, 28), KeyValue(1, 36), KeyValue(1, 45), KeyValue(1, 55), KeyValue(1, 66), KeyValue(1, 78), KeyValue(1, 91), KeyValue(1, 105), KeyValue(1, 120), KeyValue(1, 136), KeyValue(1, 153), KeyValue(1, 171), KeyValue(1, 190), KeyValue(1, 120), KeyValue(1, 136), KeyValue(1, 153), KeyValue(1, 171), KeyValue(1, 190), KeyValue(1, 210), KeyValue(1, 231), KeyValue(1, 253), KeyValue(1, 276), KeyValue(1, 300), KeyValue(1, 325), KeyValue(1, 351), KeyValue(1, 378), KeyValue(1, 406), KeyValue(1, 435), KeyValue(2, 0), KeyValue(2, 1), KeyValue(2, 3), KeyValue(2, 6), KeyValue(2, 10), KeyValue(2, 15), KeyValue(2, 21), KeyValue(2, 28), KeyValue(2, 36), KeyValue(2, 45), KeyValue(2, 55), KeyValue(2, 66), KeyValue(2, 78), KeyValue(2, 91), KeyValue(2, 105), KeyValue(2, 55), KeyValue(2, 66), KeyValue(2, 78), KeyValue(2, 91), KeyValue(2, 105), KeyValue(2, 120), KeyValue(2, 136), KeyValue(2, 153), KeyValue(2, 171), KeyValue(2, 190), KeyValue(2, 210), KeyValue(2, 231), KeyValue(2, 253), KeyValue(2, 276), KeyValue(2, 300), KeyValue(2, 325), KeyValue(2, 351), KeyValue(2, 378), KeyValue(2, 406), KeyValue(0, 210), KeyValue(0, 231), KeyValue(0, 253), KeyValue(0, 276), KeyValue(0, 300), KeyValue(0, 325), KeyValue(0, 351), KeyValue(0, 378), KeyValue(0, 406), KeyValue(0, 435)] Expected: is a value equal to or greater than <138> but: <134> was less than <138>{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15798) Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()
Justine Olshan created KAFKA-15798: -- Summary: Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology() Key: KAFKA-15798 URL: https://issues.apache.org/jira/browse/KAFKA-15798 Project: Kafka Issue Type: Bug Reporter: Justine Olshan I saw a few examples recently. 2 have the same error, but the third is different [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology___2/] [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/] The failure is like {code:java} java.lang.AssertionError: Did not receive all 5 records from topic output-stream-1 within 6 ms, currently accumulated data is [] Expected: is a value equal to or greater than <5> but: <0> was less than <5>{code} The other failure was [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/] {code:java} java.lang.AssertionError: Expected: <[0, 1]> but: was <[0]>{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15758) Always schedule wrapped callbacks
Justine Olshan created KAFKA-15758: -- Summary: Always schedule wrapped callbacks Key: KAFKA-15758 URL: https://issues.apache.org/jira/browse/KAFKA-15758 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan As part of [https://github.com/apache/kafka/commit/08aa33127a4254497456aa7a0c1646c7c38adf81] the finding of the coordinator was moved to the AddPartitionsToTxnManager. In the case of an error, we return the error on the wrapped callback. This seemed to cause issues in the tests and we realized that executing the callback directly and not rescheduling it on the request channel seemed to resolve some issues. One theory was that scheduling the callback before the request returned caused issues. Ideally we wouldn't have this special handling. This ticket is to remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15757) Do not advertise v4 AddPartitionsToTxn to clients
Justine Olshan created KAFKA-15757: -- Summary: Do not advertise v4 AddPartitionsToTxn to clients Key: KAFKA-15757 URL: https://issues.apache.org/jira/browse/KAFKA-15757 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan v4+ is intended to be a broker side API. Thus, we should not return it as a valid version to clients. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15449) Verify transactional offset commits (KIP-890 part 1)
[ https://issues.apache.org/jira/browse/KAFKA-15449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15449. Resolution: Fixed > Verify transactional offset commits (KIP-890 part 1) > > > Key: KAFKA-15449 > URL: https://issues.apache.org/jira/browse/KAFKA-15449 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Critical > > We verify on produce requests but not offset commits. We should fix this to > avoid hanging transactions on consumer offset partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15545) Update Request metrics in ops.html to reflect all the APIs
Justine Olshan created KAFKA-15545: -- Summary: Update Request metrics in ops.html to reflect all the APIs Key: KAFKA-15545 URL: https://issues.apache.org/jira/browse/KAFKA-15545 Project: Kafka Issue Type: Task Reporter: Justine Olshan When updating for KAFKA-15530, I noticed that the request metrics only mention Produce|FetchConsumer|FetchFollower. These requests metrics apply to all APIs so we should update the documentation to make this clearer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15546) Transactions tool duration field confusing for completed transactions
Justine Olshan created KAFKA-15546: -- Summary: Transactions tool duration field confusing for completed transactions Key: KAFKA-15546 URL: https://issues.apache.org/jira/browse/KAFKA-15546 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan When using the transactions tool to describe transactions, if the transaction is completed, its duration will still increase based on when it started. This value is not correct. Instead, we can leave the duration field blank (since we don't have the data for the completed transaction in the describe response). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15589) Flaky kafka.server.FetchRequestTest
[ https://issues.apache.org/jira/browse/KAFKA-15589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15589. Resolution: Duplicate Duplicate of https://issues.apache.org/jira/browse/KAFKA-15566 > Flaky kafka.server.FetchRequestTest > > > Key: KAFKA-15589 > URL: https://issues.apache.org/jira/browse/KAFKA-15589 > Project: Kafka > Issue Type: Task > Reporter: Justine Olshan >Priority: Major > Attachments: image-2023-10-11-13-19-37-012.png > > > I've been seeing a lot of test failures recently for > kafka.server.FetchRequestTest > Specifically: !image-2023-10-11-13-19-37-012.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15626) Replace verification guard object with an specific type
[ https://issues.apache.org/jira/browse/KAFKA-15626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15626. Resolution: Fixed > Replace verification guard object with an specific type > --- > > Key: KAFKA-15626 > URL: https://issues.apache.org/jira/browse/KAFKA-15626 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > https://github.com/apache/kafka/pull/13787#discussion_r1361468169 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15674) Consider making RequestLocal thread safe
Justine Olshan created KAFKA-15674: -- Summary: Consider making RequestLocal thread safe Key: KAFKA-15674 URL: https://issues.apache.org/jira/browse/KAFKA-15674 Project: Kafka Issue Type: Improvement Reporter: Justine Olshan KAFKA-15653 found an issue with using the a request local on multiple threads. The RequestLocal object was originally designed in a non-thread-safe manner for performance. It is passed around to methods that write to the log, and KAFKA-15653 showed that is it not too hard to accidentally share between different threads. Given all this, and new changes and dependencies in the project compared to when it was first introduced, we may want to reconsider the thread safety of ThreadLocal. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15380) Try complete actions after callback
Justine Olshan created KAFKA-15380: -- Summary: Try complete actions after callback Key: KAFKA-15380 URL: https://issues.apache.org/jira/browse/KAFKA-15380 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan KIP-890 part 1 introduced the callback request type. It is used to execute a callback after KafkaApis.handle has returned. We did not account for tryCompleteActions at the end of handle when making this change. In tests, we saw produce p99 increase dramatically (likely because we have to wait for another request before we can complete DelayedProduce). As a result, we should add the tryCompleteActions after the callback as well. In testing, this improved the produce performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14984) DynamicBrokerReconfigurationTest.testThreadPoolResize() test is flaky
[ https://issues.apache.org/jira/browse/KAFKA-14984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14984. Resolution: Duplicate > DynamicBrokerReconfigurationTest.testThreadPoolResize() test is flaky > -- > > Key: KAFKA-14984 > URL: https://issues.apache.org/jira/browse/KAFKA-14984 > Project: Kafka > Issue Type: Test >Reporter: Manyanda Chitimbo >Priority: Major > Labels: flaky-test > > The test sometimes fails with the below log > {code:java} > kafka.server.DynamicBrokerReconfigurationTest.testThreadPoolResize() failed, > log available in > .../core/build/reports/testOutput/kafka.server.DynamicBrokerReconfigurationTest.testThreadPoolResize().test.stdoutGradle > Test Run :core:test > Gradle Test Executor 6 > > DynamicBrokerReconfigurationTest > testThreadPoolResize() FAILED > org.opentest4j.AssertionFailedError: Invalid threads: expected 6, got 8: > List(data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, > data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, > data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, > data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, > data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, > data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, > data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, > data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0) ==> > expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at > app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at > app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211) > at > app//kafka.server.DynamicBrokerReconfigurationTest.verifyThreads(DynamicBrokerReconfigurationTest.scala:1634) > at > app//kafka.server.DynamicBrokerReconfigurationTest.testThreadPoolResize(DynamicBrokerReconfigurationTest.scala:872) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15404) Failing Test DynamicBrokerReconfigurationTest#testThreadPoolResize
Justine Olshan created KAFKA-15404: -- Summary: Failing Test DynamicBrokerReconfigurationTest#testThreadPoolResize Key: KAFKA-15404 URL: https://issues.apache.org/jira/browse/KAFKA-15404 Project: Kafka Issue Type: Bug Reporter: Justine Olshan I've seen this failing on all builds pretty consistently. {{org.opentest4j.AssertionFailedError: Invalid threads: expected 6, got 8: List(data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, data-plane-kafka-socket-acceptor-ListenerName(EXTERNAL)-SASL_SSL-0, data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0, data-plane-kafka-socket-acceptor-ListenerName(PLAINTEXT)-PLAINTEXT-0, data-plane-kafka-socket-acceptor-ListenerName(INTERNAL)-SSL-0) ==> expected: but was: }} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14097) Separate configuration for producer ID expiry
Justine Olshan created KAFKA-14097: -- Summary: Separate configuration for producer ID expiry Key: KAFKA-14097 URL: https://issues.apache.org/jira/browse/KAFKA-14097 Project: Kafka Issue Type: Improvement Reporter: Justine Olshan Ticket to track KIP-854. Currently time-based producer ID expiration is controlled by `transactional.id.expiration.ms` but we want to create a separate config. This can give us finer control over memory usage – especially since producer IDs will be more common with idempotency becoming the default. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-854+Separate+configuration+for+producer+ID+expiry -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14140) Ensure a fenced or in-controlled-shutdown replica is not eligible to become leader in ZK mode
Justine Olshan created KAFKA-14140: -- Summary: Ensure a fenced or in-controlled-shutdown replica is not eligible to become leader in ZK mode Key: KAFKA-14140 URL: https://issues.apache.org/jira/browse/KAFKA-14140 Project: Kafka Issue Type: Task Reporter: Justine Olshan Fix For: 3.3.0 KIP-841 introduced fencing on ISR in KRaft. We should also provide some of these protections in ZK, since all the ground work is mostly there. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-10550) Update AdminClient and kafka-topics.sh to support topic IDs
[ https://issues.apache.org/jira/browse/KAFKA-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-10550. Resolution: Fixed I think the scope of the kip – describe and delete has been completed so I will mark this as resolved for now. > Update AdminClient and kafka-topics.sh to support topic IDs > --- > > Key: KAFKA-10550 > URL: https://issues.apache.org/jira/browse/KAFKA-10550 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan >Assignee: Deng Ziming >Priority: Major > > Change describe topics AdminClient method to expose and support topic IDs > > Make changes to kafka-topics.sh --describe so a user can specify a topic > name to describe with the --topic parameter, or alternatively the user can > supply a topic ID with the --topic_id parameter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14561) Improve transactions experience for older clients by ensuring ongoing transaction
Justine Olshan created KAFKA-14561: -- Summary: Improve transactions experience for older clients by ensuring ongoing transaction Key: KAFKA-14561 URL: https://issues.apache.org/jira/browse/KAFKA-14561 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan This is part 3 of KIP-890: 3. *To cover older clients, we will ensure a transaction is ongoing before we write to a transaction. We can do this by querying the transaction coordinator and caching the result.* See KIP-890 for more details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14562) Implement epoch bump after every transaction
Justine Olshan created KAFKA-14562: -- Summary: Implement epoch bump after every transaction Key: KAFKA-14562 URL: https://issues.apache.org/jira/browse/KAFKA-14562 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan This is part 1 of KIP-890 # *Uniquely identify transactions by bumping the producer epoch after every commit/abort marker. That way, each transaction can be identified by (producer id, epoch).* See KIP-890 for more information: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14563) Remove AddPartitionsToTxn call for newer clients as optimization
Justine Olshan created KAFKA-14563: -- Summary: Remove AddPartitionsToTxn call for newer clients as optimization Key: KAFKA-14563 URL: https://issues.apache.org/jira/browse/KAFKA-14563 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan This is part 2 of KIP-890: {*}2. Remove the addPartitionsToTxn call and implicitly just add partitions to the transaction on the first produce request during a transaction{*}. See KIP-890 for more information: https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14439) Specify returned errors for various APIs and versions
Justine Olshan created KAFKA-14439: -- Summary: Specify returned errors for various APIs and versions Key: KAFKA-14439 URL: https://issues.apache.org/jira/browse/KAFKA-14439 Project: Kafka Issue Type: Task Reporter: Justine Olshan Kafka is known for supporting various clients and being compatible across different versions. But one thing that is a bit unclear is what errors each response can send. Knowing what errors can come from each version helps those who implement clients have a more defined spec for what errors they need to handle. When new errors are added, it is clearer to the clients that changes need to be made. It also helps contributors get a better understanding about how clients are expected to react and potentially find and prevent gaps like the one found in https://issues.apache.org/jira/browse/KAFKA-14417 I briefly synced offline with [~hachikuji] about this and he suggested maybe adding values for the error codes in the schema definitions of APIs that specify the error codes and what versions they are returned on. One idea was creating some enum type to accomplish this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14402) Transactions Server Side Defense
Justine Olshan created KAFKA-14402: -- Summary: Transactions Server Side Defense Key: KAFKA-14402 URL: https://issues.apache.org/jira/browse/KAFKA-14402 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan We have seen hanging transactions in Kafka where the last stable offset (LSO) does not update, we can’t clean the log (if the topic is compacted), and read_committed consumers get stuck. This can happen when a message gets stuck or delayed due to networking issues or a network partition, the transaction aborts, and then the delayed message finally comes in. The delayed message case can also violate EOS if the delayed message comes in after the next addPartitionsToTxn request comes in. Effectively we may see a message from a previous (aborted) transaction become part of the next transaction. Another way hanging transactions can occur is that a client is buggy and may somehow try to write to a partition before it adds the partition to the transaction. In both of these cases, we want the server to have some control to prevent these incorrect records from being written and either causing hanging transactions or violating Exactly once semantics (EOS) by including records in the wrong transaction. The best way to avoid this issue is to: # *Uniquely identify transactions by bumping the producer epoch after every commit/abort marker. That way, each transaction can be identified by (producer id, epoch).* # {*}Remove the addPartitionsToTxn call and implicitly just add partitions to the transaction on the first produce request during a transaction{*}. We avoid the late arrival case because the transaction is uniquely identified and fenced AND we avoid the buggy client case because we remove the need for the client to explicitly add partitions to begin the transaction. Of course, 1 and 2 require client-side changes, so for older clients, those approaches won’t apply. 3. *To cover older clients, we will ensure a transaction is ongoing before we write to a transaction. We can do this by querying the transaction coordinator and caching the result.* See KIP-890 for more information: ** https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14417) Producer doesn't handle REQUEST_TIMED_OUT for InitProducerIdRequest
Justine Olshan created KAFKA-14417: -- Summary: Producer doesn't handle REQUEST_TIMED_OUT for InitProducerIdRequest Key: KAFKA-14417 URL: https://issues.apache.org/jira/browse/KAFKA-14417 Project: Kafka Issue Type: Task Affects Versions: 3.3.0, 3.2.0, 3.0.0, 3.1.0 Reporter: Justine Olshan In TransactionManager we have a handler for InitProducerIdRequests [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#LL1276C14-L1276C14] However, we have the potential to return a REQUEST_TIMED_OUT error in ProducerIdManager when the BrokerToControllerChannel manager times out: [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/core/src/main/scala/kafka/coordinator/transaction/ProducerIdManager.scala#L236] or when the poll returns null: [https://github.com/apache/kafka/blob/19286449ee20f85cc81860e13df14467d4ce287c/core/src/main/scala/kafka/coordinator/transaction/ProducerIdManager.scala#L170] Since REQUEST_TIMED_OUT is not handled by the producer, we treat it as a fatal error. With the default of idempotent producers, this can cause more issues. Seems like the commit that introduced the changes was this one: [https://github.com/apache/kafka/commit/72d108274c98dca44514007254552481c731c958] so we are vulnerable when the server code is ibp 3.0 and beyond. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14640) Update AddPartitionsToTxn protocol to batch and handle verifyOnly requests
Justine Olshan created KAFKA-14640: -- Summary: Update AddPartitionsToTxn protocol to batch and handle verifyOnly requests Key: KAFKA-14640 URL: https://issues.apache.org/jira/browse/KAFKA-14640 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan As part of KIP-890 we are making some changes to this protocol. 1. We can send a request to verify a partition is added to a transaction 2. We can batch multiple transactional IDs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14359) Idempotent Producer continues to retry on OutOfOrderSequence error when first batch fails
Justine Olshan created KAFKA-14359: -- Summary: Idempotent Producer continues to retry on OutOfOrderSequence error when first batch fails Key: KAFKA-14359 URL: https://issues.apache.org/jira/browse/KAFKA-14359 Project: Kafka Issue Type: Task Reporter: Justine Olshan When the idempotent producer does not have any state it can fall into a state where the producer keeps retrying an out of order sequence. Consider the following scenario where an idempotent producer has retries and delivery timeout are int max (a configuration used in streams). 1. A producer send out several batches (up to 5) with the first one starting at sequence 0. 2. The first batch with sequence 0 fails due to a transient error (ie, NOT_LEADER_OR_FOLLOWER or a timeout error) 3. The second batch, say with sequence 200 comes in. Since there is no previous state to invalidate it, it gets written to the log 4. The original batch is retried and will get an out of order sequence number 5. Current java client will continue to retry this batch, but it will never resolve. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14790) Add more AddPartitionsToTxn tests in KafkaApis and Authorizer tests
Justine Olshan created KAFKA-14790: -- Summary: Add more AddPartitionsToTxn tests in KafkaApis and Authorizer tests Key: KAFKA-14790 URL: https://issues.apache.org/jira/browse/KAFKA-14790 Project: Kafka Issue Type: Test Reporter: Justine Olshan Assignee: Justine Olshan Followup from [https://github.com/apache/kafka/pull/13231] We should add authorizer tests for the new version. We should add some more tests to KafkaApis to cover auth and validation failures. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14640) Update AddPartitionsToTxn protocol to batch and handle verifyOnly requests
[ https://issues.apache.org/jira/browse/KAFKA-14640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14640. Resolution: Fixed > Update AddPartitionsToTxn protocol to batch and handle verifyOnly requests > -- > > Key: KAFKA-14640 > URL: https://issues.apache.org/jira/browse/KAFKA-14640 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > As part of KIP-890 we are making some changes to this protocol. > 1. We can send a request to verify a partition is added to a transaction > 2. We can batch multiple transactional IDs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14790) Add more AddPartitionsToTxn tests in KafkaApis and Authorizer tests
[ https://issues.apache.org/jira/browse/KAFKA-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14790. Resolution: Fixed > Add more AddPartitionsToTxn tests in KafkaApis and Authorizer tests > --- > > Key: KAFKA-14790 > URL: https://issues.apache.org/jira/browse/KAFKA-14790 > Project: Kafka > Issue Type: Test > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Minor > > Followup from [https://github.com/apache/kafka/pull/13231] > We should add authorizer tests for the new version. > We should add some more tests to KafkaApis to cover auth and validation > failures. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14916) Fix code that assumes transactional ID implies all records are transactional
Justine Olshan created KAFKA-14916: -- Summary: Fix code that assumes transactional ID implies all records are transactional Key: KAFKA-14916 URL: https://issues.apache.org/jira/browse/KAFKA-14916 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan KAFKA-14561 wrote code that assumed that if a transactional ID was included, all record batches were transactional and had the same producer ID. This work with improve validation and fix the code that assumes all batches are transactional. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14917) Producer write while transaction is pending.
Justine Olshan created KAFKA-14917: -- Summary: Producer write while transaction is pending. Key: KAFKA-14917 URL: https://issues.apache.org/jira/browse/KAFKA-14917 Project: Kafka Issue Type: Bug Reporter: Justine Olshan Assignee: Justine Olshan As discovered in KAFKA-14904, we seem to get into a state where we try to write to a partition while the ongoing state is still pending. This is likely a bigger issue than the test and worth looking in to. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14899) Revisit Action Queue
Justine Olshan created KAFKA-14899: -- Summary: Revisit Action Queue Key: KAFKA-14899 URL: https://issues.apache.org/jira/browse/KAFKA-14899 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan With Kafka-14561 we introduced a notion for callback requests. It would be nice to standardize and combine action queue usage here. However, the current implementation of the callback request assumes local time is computed upon response send. This same paradigm may not be the case with the action queue. We should follow up and see what changes need to be made to combine the two. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14917) Producer write while transaction is pending.
[ https://issues.apache.org/jira/browse/KAFKA-14917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14917. Resolution: Won't Fix > Producer write while transaction is pending. > > > Key: KAFKA-14917 > URL: https://issues.apache.org/jira/browse/KAFKA-14917 > Project: Kafka > Issue Type: Bug > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > As discovered in KAFKA-14904, we seem to get into a state where we try to > write to a partition while the ongoing state is still pending. > This is likely a bigger issue than the test and worth looking in to. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14920) Address timeouts and out of order sequences
Justine Olshan created KAFKA-14920: -- Summary: Address timeouts and out of order sequences Key: KAFKA-14920 URL: https://issues.apache.org/jira/browse/KAFKA-14920 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan KAFKA-14844 showed the destructive nature of a timeout on the first produce request for a topic partition (ie one that has no state in psm) Since we currently don't validate the first sequence (we will in part 2 of kip-890), any transient error on the first produce can lead to out of order sequences that never recover. Originally, KAFKA-14561 relied on the producer's retry mechanism for these transient issues, but until that is fixed, we may need to retry from in the AddPartitionsManager instead. We addressed the concurrent transactions, but there are other errors like coordinator loading that we could run into and see increased out of order issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14884) Include check transaction is still ongoing right before append
[ https://issues.apache.org/jira/browse/KAFKA-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14884. Resolution: Fixed > Include check transaction is still ongoing right before append > --- > > Key: KAFKA-14884 > URL: https://issues.apache.org/jira/browse/KAFKA-14884 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > Even after checking via AddPartitionsToTxn, the transaction could be aborted > after the response. We can add one more check before appending. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-14884) Include check transaction is still ongoing right before append
[ https://issues.apache.org/jira/browse/KAFKA-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reopened KAFKA-14884: I'm confused by all my blockers 臘♀️ > Include check transaction is still ongoing right before append > --- > > Key: KAFKA-14884 > URL: https://issues.apache.org/jira/browse/KAFKA-14884 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > Even after checking via AddPartitionsToTxn, the transaction could be aborted > after the response. We can add one more check before appending. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14904) Flaky Test kafka.api.TransactionsBounceTest.testWithGroupId()
[ https://issues.apache.org/jira/browse/KAFKA-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14904. Resolution: Fixed > Flaky Test kafka.api.TransactionsBounceTest.testWithGroupId() > -- > > Key: KAFKA-14904 > URL: https://issues.apache.org/jira/browse/KAFKA-14904 > Project: Kafka > Issue Type: Test >Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > After merging KAFKA-14561 I noticed this test still occasionally failed via > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6ms while awaiting EndTxn(true) > I will investigate the cause. > Note: This error occurs when we are waiting for the transaction to be > committed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14931) Revert KAFKA-14561 in 3.5
Justine Olshan created KAFKA-14931: -- Summary: Revert KAFKA-14561 in 3.5 Key: KAFKA-14931 URL: https://issues.apache.org/jira/browse/KAFKA-14931 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan We have too many blockers for this commit to work well, so in the interest of code quality, we should revert in 3.5 and fix the issues for 3.6 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14931) Revert KAFKA-14561 in 3.5
[ https://issues.apache.org/jira/browse/KAFKA-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14931. Resolution: Fixed > Revert KAFKA-14561 in 3.5 > - > > Key: KAFKA-14931 > URL: https://issues.apache.org/jira/browse/KAFKA-14931 > Project: Kafka > Issue Type: Task >Affects Versions: 3.5.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > We have too many blockers for this commit to work well, so in the interest of > code quality, we should revert > https://issues.apache.org/jira/browse/KAFKA-14561 in 3.5 and fix the issues > for 3.6 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14958) Investigate enforcing all batches have the same producer ID
Justine Olshan created KAFKA-14958: -- Summary: Investigate enforcing all batches have the same producer ID Key: KAFKA-14958 URL: https://issues.apache.org/jira/browse/KAFKA-14958 Project: Kafka Issue Type: Task Reporter: Justine Olshan KAFKA-14916 was created after I incorrectly assumed transaction ID in the produce request indicated all batches were transactional. Originally this ticket had an action item to ensure all the producer IDs are the same in the batches since we send a single txn ID, but we decided this can be done in a followup, as we still need to assess if we can enforce this without breaking workloads. This ticket is that followup. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14854) Refactor inter broker send thread to handle all interbroker requests on one thread
Justine Olshan created KAFKA-14854: -- Summary: Refactor inter broker send thread to handle all interbroker requests on one thread Key: KAFKA-14854 URL: https://issues.apache.org/jira/browse/KAFKA-14854 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan Currently we create a new thread for each interbroker request that implements InterbrokerSendThread. It would be better to implement a single thread that multiple request types can use with their custom logic. I propose creating a single thread that takes a collection of "managers" for each request and sends the requests generated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14884) Include check transaction is still ongoing right before append
Justine Olshan created KAFKA-14884: -- Summary: Include check transaction is still ongoing right before append Key: KAFKA-14884 URL: https://issues.apache.org/jira/browse/KAFKA-14884 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan Even after checking via AddPartitionsToTxn, the transaction could be aborted after the response. We can add one more check before appending. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14895) Move AddPartitionsToTxnManager files to java
Justine Olshan created KAFKA-14895: -- Summary: Move AddPartitionsToTxnManager files to java Key: KAFKA-14895 URL: https://issues.apache.org/jira/browse/KAFKA-14895 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan Assignee: Justine Olshan Followup task to move the files from scala to java. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14896) TransactionsBounceTest causes a thread leak
Justine Olshan created KAFKA-14896: -- Summary: TransactionsBounceTest causes a thread leak Key: KAFKA-14896 URL: https://issues.apache.org/jira/browse/KAFKA-14896 Project: Kafka Issue Type: Bug Reporter: Justine Olshan Assignee: Justine Olshan On several PR builds I see a test fail with ["Producer closed forcefully" |https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13391/21/testReport/junit/kafka.api/TransactionsBounceTest/Build___JDK_8_and_Scala_2_12___testWithGroupId__/] and then many other tests fail with initialization errors due to [controller-event-thread,daemon-broker-bouncer-EventThread|https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13391/21/testReport/junit/kafka.api/TransactionsBounceTest/Build___JDK_8_and_Scala_2_12___executionError/] In TransactionsBounceTest.testBrokerFailure, we create this thread to bounce the brokers. There is a finally block to shut it down but it seems to not be working. We should shut it down correctly. Examples of failures: [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13391/21/#showFailuresLink] [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13391/17/#showFailuresLink] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14904) Flaky Test kafka.api.TransactionsBounceTest.testWithGroupId()
Justine Olshan created KAFKA-14904: -- Summary: Flaky Test kafka.api.TransactionsBounceTest.testWithGroupId() Key: KAFKA-14904 URL: https://issues.apache.org/jira/browse/KAFKA-14904 Project: Kafka Issue Type: Test Reporter: Justine Olshan Assignee: Justine Olshan After merging KAFKA-14561 I noticed this test still occasionally failed via org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6ms while awaiting EndTxn(true) I will investigate the cause. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14884) Include check transaction is still ongoing right before append
[ https://issues.apache.org/jira/browse/KAFKA-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14884. Resolution: Fixed > Include check transaction is still ongoing right before append > --- > > Key: KAFKA-14884 > URL: https://issues.apache.org/jira/browse/KAFKA-14884 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.6.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > Even after checking via AddPartitionsToTxn, the transaction could be aborted > after the response. We can add one more check before appending. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15044) Snappy v.1.1.9.1 NoClassDefFound on ARM machines
[ https://issues.apache.org/jira/browse/KAFKA-15044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15044. Resolution: Fixed > Snappy v.1.1.9.1 NoClassDefFound on ARM machines > > > Key: KAFKA-15044 > URL: https://issues.apache.org/jira/browse/KAFKA-15044 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.5.0 >Reporter: David Mao >Assignee: David Mao >Priority: Major > > We upgraded our snappy dependency but v1.1.9.1 has compatibility issues with > arm. We should upgrade to v1.1.10.0 which resolves this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15028) AddPartitionsToTxnManager metrics
Justine Olshan created KAFKA-15028: -- Summary: AddPartitionsToTxnManager metrics Key: KAFKA-15028 URL: https://issues.apache.org/jira/browse/KAFKA-15028 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan KIP-890 added metrics for the AddPartitionsToTxnManager VerificationTimeMs – number of milliseconds from adding partition info to the manager to the time the response is sent. This will include the round trip to the transaction coordinator if it is called. This will also account for verifications that fail before the coordinator is called. VerificationFailureRate – rate of verifications that returned in failure either from the AddPartitionsToTxn response or through errors in the manager. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14920) Address timeouts and out of order sequences
[ https://issues.apache.org/jira/browse/KAFKA-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14920. Resolution: Fixed > Address timeouts and out of order sequences > --- > > Key: KAFKA-14920 > URL: https://issues.apache.org/jira/browse/KAFKA-14920 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.6.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > KAFKA-14844 showed the destructive nature of a timeout on the first produce > request for a topic partition (ie one that has no state in psm) > Since we currently don't validate the first sequence (we will in part 2 of > kip-890), any transient error on the first produce can lead to out of order > sequences that never recover. > Originally, KAFKA-14561 relied on the producer's retry mechanism for these > transient issues, but until that is fixed, we may need to retry from in the > AddPartitionsManager instead. We addressed the concurrent transactions, but > there are other errors like coordinator loading that we could run into and > see increased out of order issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15028) AddPartitionsToTxnManager metrics
[ https://issues.apache.org/jira/browse/KAFKA-15028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15028. Resolution: Fixed > AddPartitionsToTxnManager metrics > - > > Key: KAFKA-15028 > URL: https://issues.apache.org/jira/browse/KAFKA-15028 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > Attachments: latency-cpu.html > > > KIP-890 added metrics for the AddPartitionsToTxnManager > VerificationTimeMs – number of milliseconds from adding partition info to the > manager to the time the response is sent. This will include the round trip to > the transaction coordinator if it is called. This will also account for > verifications that fail before the coordinator is called. > VerificationFailureRate – rate of verifications that returned in failure > either from the AddPartitionsToTxn response or through errors in the manager. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15099) Flaky Test kafka.api.TransactionsTest.testBumpTransactionalEpoch(String).quorum=kraft
Justine Olshan created KAFKA-15099: -- Summary: Flaky Test kafka.api.TransactionsTest.testBumpTransactionalEpoch(String).quorum=kraft Key: KAFKA-15099 URL: https://issues.apache.org/jira/browse/KAFKA-15099 Project: Kafka Issue Type: Bug Reporter: Justine Olshan This one often fails with: org.apache.kafka.common.errors.TimeoutException: Timeout expired after 6ms while awaiting InitProducerId seems like a Kraft only issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14916) Fix code that assumes transactional ID implies all records are transactional
[ https://issues.apache.org/jira/browse/KAFKA-14916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-14916. Resolution: Fixed > Fix code that assumes transactional ID implies all records are transactional > > > Key: KAFKA-14916 > URL: https://issues.apache.org/jira/browse/KAFKA-14916 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.6.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > KAFKA-14561 wrote code that assumed that if a transactional ID was included, > all record batches were transactional and had the same producer ID. > This work with improve validation and fix the code that assumes all batches > are transactional. > Further, KAFKA-14561 will not assume all records are transactional. > Originally this ticket had an action item to ensure all the producer IDs are > the same in the batches since we send a single txn ID, but that can be done > in a followup KAFKA-14958, as we still need to assess if we can enforce this > without breaking workloads. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16192) Introduce usage of flexible records to coordinators
Justine Olshan created KAFKA-16192: -- Summary: Introduce usage of flexible records to coordinators Key: KAFKA-16192 URL: https://issues.apache.org/jira/browse/KAFKA-16192 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan [KIP-915| https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. Typically this is done as a message format bump. There may be an option to make this change using MV since if the readers of the records are internal and not external consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16229) Slow expiration of Producer IDs leading to high CPU usage
[ https://issues.apache.org/jira/browse/KAFKA-16229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16229. Resolution: Fixed > Slow expiration of Producer IDs leading to high CPU usage > - > > Key: KAFKA-16229 > URL: https://issues.apache.org/jira/browse/KAFKA-16229 > Project: Kafka > Issue Type: Bug >Reporter: Jorge Esteban Quilcate Otoya >Assignee: Jorge Esteban Quilcate Otoya >Priority: Major > > Expiration of ProducerIds is implemented with a slow removal of map keys: > ``` > producers.keySet().removeAll(keys); > ``` > Unnecessarily going through all producer ids and then throw all expired keys > to be removed. > This leads to exponential time on worst case when most/all keys need to be > removed: > ``` > Benchmark (numProducerIds) Mode Cnt > Score Error Units > ProducerStateManagerBench.testDeleteExpiringIds 100 avgt 3 > 9164.043 ± 10647.877 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1000 avgt 3 > 341561.093 ± 20283.211 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1 avgt 3 > 44957983.550 ± 9389011.290 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 10 avgt 3 > 5683374164.167 ± 1446242131.466 ns/op > ``` > A simple fix is to use map#remove(key) instead, leading to a more linear > growth: > ``` > Benchmark (numProducerIds) Mode Cnt > Score Error Units > ProducerStateManagerBench.testDeleteExpiringIds 100 avgt 3 > 5779.056 ± 651.389 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1000 avgt 3 > 61430.530 ± 21875.644 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1 avgt 3 > 643887.031 ± 600475.302 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 10 avgt 3 > 7741689.539 ± 3218317.079 ns/op > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16245) DescribeConsumerGroupTest failing
Justine Olshan created KAFKA-16245: -- Summary: DescribeConsumerGroupTest failing Key: KAFKA-16245 URL: https://issues.apache.org/jira/browse/KAFKA-16245 Project: Kafka Issue Type: Task Reporter: Justine Olshan The first instances on trunk are in this PR [https://github.com/apache/kafka/pull/15275] And this PR seems to have it failing consistently in the builds when it wasn't failing this consistently before. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15665) Enforce ISR to have all target replicas when complete partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15665. Resolution: Fixed > Enforce ISR to have all target replicas when complete partition reassignment > > > Key: KAFKA-15665 > URL: https://issues.apache.org/jira/browse/KAFKA-15665 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > Current partition reassignment can be completed when the new ISR is under min > ISR. We should fix this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16012) Incomplete range assignment in consumer
[ https://issues.apache.org/jira/browse/KAFKA-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16012. Resolution: Fixed > Incomplete range assignment in consumer > --- > > Key: KAFKA-16012 > URL: https://issues.apache.org/jira/browse/KAFKA-16012 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Philip Nee >Priority: Blocker > Fix For: 3.7.0 > > > We were looking into test failures here: > https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1702475525--jolshan--kafka-15784--7cad567675/2023-12-13--001./2023-12-13–001./report.html. > > Here is the first failure in the report: > {code:java} > > test_id: > kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=brokers > status: FAIL > run time: 3 minutes 4.950 seconds > TimeoutError('Consumer consumed only 88223 out of 10 messages in > 90s') {code} > > We traced the failure to an apparent bug during the last rebalance before the > group became empty. The last remaining instance seems to receive an > incomplete assignment which prevents it from completing expected consumption > on some partitions. Here is the rebalance from the coordinator's perspective: > {code:java} > server.log.2023-12-13-04:[2023-12-13 04:58:56,987] INFO [GroupCoordinator 3]: > Stabilized group grouped-transactions-test-consumer-group generation 5 > (__consumer_offsets-2) with 1 members > (kafka.coordinator.group.GroupCoordinator) > server.log.2023-12-13-04:[2023-12-13 04:58:56,990] INFO [GroupCoordinator 3]: > Assignment received from leader > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > for group grouped-transactions-test-consumer-group for generation 5. The > group has 1 members, 0 of which are static. > (kafka.coordinator.group.GroupCoordinator) {code} > The group is down to one member in generation 5. In the previous generation, > the consumer in question reported this assignment: > {code:java} > // Gen 4: we've got partitions 0-4 > [2023-12-13 04:58:52,631] DEBUG [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete > with generation 4 and memberId > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > [2023-12-13 04:58:52,631] INFO [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Notifying assignor about > the new Assignment(partitions=[input-topic-0, input-topic-1, input-topic-2, > input-topic-3, input-topic-4]) > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} > However, in generation 5, we seem to be assigned only one partition: > {code:java} > // Gen 5: Now we have only partition 1? But aren't we the last member in the > group? > [2023-12-13 04:58:56,954] DEBUG [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete > with generation 5 and memberId > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > [2023-12-13 04:58:56,955] INFO [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Notifying assignor about > the new Assignment(partitions=[input-topic-1]) > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} > The assignment type is range from the JoinGroup for generation 5. The decoded > metadata sent by the consumer is this: > {code:java} > Subscription(topics=[input-topic], ownedPartitions=[], groupInstanceId=null, > generationId=4, rackId=null) {code} > Here is the decoded assignment from the SyncGroup: > {code:java} > Assignment(partitions=[input-topic-1]) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15784) Ensure atomicity of in memory update and write when transactionally committing offsets
[ https://issues.apache.org/jira/browse/KAFKA-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15784. Resolution: Fixed > Ensure atomicity of in memory update and write when transactionally > committing offsets > -- > > Key: KAFKA-15784 > URL: https://issues.apache.org/jira/browse/KAFKA-15784 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.7.0 > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Blocker > > [https://github.com/apache/kafka/pull/14370] (KAFKA-15449) removed the > locking around validating, updating state, and writing to the log > transactional offset commits. (The verification causes us to release the lock) > This was discovered in the discussion of > [https://github.com/apache/kafka/pull/14629] (KAFKA-15653). > Since KAFKA-15653 is needed for 3.5.1, it makes sense to address the locking > issue separately with this ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
Justine Olshan created KAFKA-16045: -- Summary: ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky Key: KAFKA-16045 URL: https://issues.apache.org/jira/browse/KAFKA-16045 Project: Kafka Issue Type: Test Reporter: Justine Olshan I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many builds. I believe it is also causing a thread leak because on most runs where it fails I also see ReplicaManager tests also fail with extra threads. The test always fails `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be deleted` gradle enterprise link: [https://ge.apache.org/scans/tests?search.names=Git%20branch[…]lues=trunk=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.zk.ZkMigrationIntegrationTest] recent pr: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/] trunk builds: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests], [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests] (edited) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received
Justine Olshan created KAFKA-16122: -- Summary: TransactionsBounceTest -- server disconnected before response was received Key: KAFKA-16122 URL: https://issues.apache.org/jira/browse/KAFKA-16122 Project: Kafka Issue Type: Test Reporter: Justine Olshan I noticed a ton of tests failing with h4. {code:java} Error org.apache.kafka.common.KafkaException: Unexpected error in TxnOffsetCommitResponse: The server disconnected before a response was received. {code} {code:java} Stacktrace org.apache.kafka.common.KafkaException: Unexpected error in TxnOffsetCommitResponse: The server disconnected before a response was received. at app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702) at app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236) at app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) at app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608) at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) at app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457) at app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334) at app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249) at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code} The error indicates a network error which is retriable but the TxnOffsetCommit handler doesn't expect this. https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other requests but not this one. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15975) Update kafka quickstart guide to no longer list ZK start first
Justine Olshan created KAFKA-15975: -- Summary: Update kafka quickstart guide to no longer list ZK start first Key: KAFKA-15975 URL: https://issues.apache.org/jira/browse/KAFKA-15975 Project: Kafka Issue Type: Task Components: docs Affects Versions: 4.0.0 Reporter: Justine Olshan Given we are deprecating ZooKeeper, I think we should update our quickstart guide to not list the ZooKeeper instructions first. With 4.0, we may want to remove it entirely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15957) ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken
Justine Olshan created KAFKA-15957: -- Summary: ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken Key: KAFKA-15957 URL: https://issues.apache.org/jira/browse/KAFKA-15957 Project: Kafka Issue Type: Bug Reporter: Justine Olshan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15984) Client disconnections can cause hanging transactions on __consumer_offsets
Justine Olshan created KAFKA-15984: -- Summary: Client disconnections can cause hanging transactions on __consumer_offsets Key: KAFKA-15984 URL: https://issues.apache.org/jira/browse/KAFKA-15984 Project: Kafka Issue Type: Task Reporter: Justine Olshan When investigating frequent hanging transactions on __consumer_offsets partitions, we realized that many of them were cause by the same offset being committed with duplicates and one with `"isDisconnectedClient":true`. TxnOffsetCommits do not have sequence numbers and thus are not protected against duplicates in the same way idempotent produce requests are. Thus, when a client is disconnected (and flushes its requests), we may see the duplicate get appended to the log. KIP-890 part 1 should protect against this as the duplicate will not succeed verification. KIP-890 part 2 strengthens this further as duplicates (from previous transactions) can not be added to new transactions if the partitions is re-added since the epoch will be bumped. Another possible solution is to do duplicate checking on the group coordinator side when the request comes in. This solution could be used instead of KIP-890 part 1 to prevent hanging transactions but given that part 1 only has one open PR remaining, we may not need to do this. However, this can also prevent duplicates from being added to a new transaction – something only part 2 will protect against. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15987) Refactor ReplicaManager code
Justine Olshan created KAFKA-15987: -- Summary: Refactor ReplicaManager code Key: KAFKA-15987 URL: https://issues.apache.org/jira/browse/KAFKA-15987 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan I started to do this in KAFKA-15784, but the diff was deemed too large and confusing. I just wanted to file a followup ticket to reference this in code for the areas that will be refactored. I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received
[ https://issues.apache.org/jira/browse/KAFKA-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16122. Resolution: Fixed > TransactionsBounceTest -- server disconnected before response was received > -- > > Key: KAFKA-16122 > URL: https://issues.apache.org/jira/browse/KAFKA-16122 > Project: Kafka > Issue Type: Test > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > I noticed a ton of tests failing with > h4. > {code:java} > Error org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. {code} > {code:java} > Stacktrace org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702) > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236) > at > app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) > at > app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608) > at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) > at > app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457) > at > app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334) > at > app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249) > at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code} > The error indicates a network error which is retriable but the > TxnOffsetCommit handler doesn't expect this. > https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other > requests but not this one. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15987) Refactor ReplicaManager code for transaction verification
[ https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15987. Resolution: Fixed > Refactor ReplicaManager code for transaction verification > - > > Key: KAFKA-15987 > URL: https://issues.apache.org/jira/browse/KAFKA-15987 > Project: Kafka > Issue Type: Sub-task > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > I started to do this in KAFKA-15784, but the diff was deemed too large and > confusing. I just wanted to file a followup ticket to reference this in code > for the areas that will be refactored. > > I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15653) NPE in ChunkedByteStream
[ https://issues.apache.org/jira/browse/KAFKA-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15653. Fix Version/s: 3.7.0 3.6.1 Resolution: Fixed > NPE in ChunkedByteStream > > > Key: KAFKA-15653 > URL: https://issues.apache.org/jira/browse/KAFKA-15653 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 3.6.0 > Environment: Docker container on a Linux laptop, using the latest > release. >Reporter: Travis Bischel >Assignee: Justine Olshan >Priority: Major > Fix For: 3.7.0, 3.6.1 > > Attachments: repro.sh > > > When looping franz-go integration tests, I received an UNKNOWN_SERVER_ERROR > from producing. The broker logs for the failing request: > > {noformat} > [2023-10-19 22:29:58,160] ERROR [ReplicaManager broker=2] Error processing > append operation on partition > 2fa8995d8002fbfe68a96d783f26aa2c5efc15368bf44ed8f2ab7e24b41b9879-24 > (kafka.server.ReplicaManager) > java.lang.NullPointerException > at > org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89) > at > org.apache.kafka.common.record.CompressionType$3.wrapForInput(CompressionType.java:105) > at > org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273) > at > org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277) > at > org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:805) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198) > at kafka.server.ReplicaManager.appendEntries$1(ReplicaManager.scala:754) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18(ReplicaManager.scala:874) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:874) > at > kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130) > at java.base/java.lang.Thread.run(Unknown Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
Justine Olshan created KAFKA-16570: -- Summary: FenceProducers API returns "unexpected error" when successful Key: KAFKA-16570 URL: https://issues.apache.org/jira/browse/KAFKA-16570 Project: Kafka Issue Type: Bug Reporter: Justine Olshan Assignee: Justine Olshan When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16513) Allow WriteTxnMarkers API with Alter Cluster Permission
[ https://issues.apache.org/jira/browse/KAFKA-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16513. Resolution: Fixed > Allow WriteTxnMarkers API with Alter Cluster Permission > --- > > Key: KAFKA-16513 > URL: https://issues.apache.org/jira/browse/KAFKA-16513 > Project: Kafka > Issue Type: Improvement > Components: admin >Reporter: Nikhil Ramakrishnan >Assignee: Siddharth Yagnik >Priority: Minor > Labels: KIP-1037 > Fix For: 3.8.0 > > > We should allow WriteTxnMarkers API with Alter Cluster Permission because it > can invoked externally by a Kafka AdminClient. Such usage is more aligned > with the Alter permission on the Cluster resource, which includes other > administrative actions invoked from the Kafka AdminClient. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager
[ https://issues.apache.org/jira/browse/KAFKA-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16451. Resolution: Duplicate > testDeltaFollower tests failing in ReplicaManager > - > > Key: KAFKA-16451 > URL: https://issues.apache.org/jira/browse/KAFKA-16451 > Project: Kafka > Issue Type: Bug > Reporter: Justine Olshan >Priority: Major > > many ReplicaManagerTests with the prefix testDeltaFollower seem to be > failing. A few other ReplicaManager tests as well. See existing failures in > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager
Justine Olshan created KAFKA-16451: -- Summary: testDeltaFollower tests failing in ReplicaManager Key: KAFKA-16451 URL: https://issues.apache.org/jira/browse/KAFKA-16451 Project: Kafka Issue Type: Bug Reporter: Justine Olshan many ReplicaManagerTests with the prefix testDeltaFollower seem to be failing. A few other ReplicaManager tests as well. See existing failures in [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16302) Builds failing due to streams test execution failures
Justine Olshan created KAFKA-16302: -- Summary: Builds failing due to streams test execution failures Key: KAFKA-16302 URL: https://issues.apache.org/jira/browse/KAFKA-16302 Project: Kafka Issue Type: Task Reporter: Justine Olshan I'm seeing this on master and many PR builds for all versions: ``` [2024-02-22T14:37:07.076Z] * What went wrong: [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426][2024-02-22T14:37:07.076Z] Execution failed for task ':streams:test'. [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427][2024-02-22T14:37:07.076Z] > The following test methods could not be retried, which is unexpected. Please file a bug report at [https://github.com/gradle/test-retry-gradle-plugin/issues] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] ``` [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432][2024-02-22T14:37:07.076Z] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16302) Builds failing due to streams test execution failures
[ https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16302. Resolution: Fixed > Builds failing due to streams test execution failures > - > > Key: KAFKA-16302 > URL: https://issues.apache.org/jira/browse/KAFKA-16302 > Project: Kafka > Issue Type: Task > Components: streams, unit tests > Reporter: Justine Olshan > Assignee: Justine Olshan >Priority: Major > > I'm seeing this on master and many PR builds for all versions: > > {code:java} > [2024-02-22T14:37:07.076Z] * What went wrong: > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z] > Execution failed for task ':streams:test'. > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z] > > The following test methods could not be retried, which is unexpected. > Please file a bug report at > https://github.com/gradle/test-retry-gradle-plugin/issues > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16308) Formatting and Updating Kafka Features
Justine Olshan created KAFKA-16308: -- Summary: Formatting and Updating Kafka Features Key: KAFKA-16308 URL: https://issues.apache.org/jira/browse/KAFKA-16308 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan As part of KIP-1022, we need to extend the storage and upgrade tools to support features other than metadata version. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16841) ZKMigrationIntegrationTests broken
[ https://issues.apache.org/jira/browse/KAFKA-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16841. Resolution: Fixed fixed by https://github.com/apache/kafka/commit/bac8df56ffdf8a64ecfb78ec0779bcbc8e9f7c10 > ZKMigrationIntegrationTests broken > -- > > Key: KAFKA-16841 > URL: https://issues.apache.org/jira/browse/KAFKA-16841 > Project: Kafka > Issue Type: Task > Reporter: Justine Olshan >Priority: Blocker > > A recent merge to trunk seems to have broken tests so that I see 78 failures > in the CI. > I see lots of timeout errors and `Alter Topic Configs had an error` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16692. Fix Version/s: 3.6.3 Resolution: Fixed > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.7.0, 3.6.1, 3.8 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > Fix For: 3.7.1, 3.6.3, 3.8 > > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method
[jira] [Created] (KAFKA-16866) RemoteLogManagerTest.testCopyQuotaManagerConfig failing
Justine Olshan created KAFKA-16866: -- Summary: RemoteLogManagerTest.testCopyQuotaManagerConfig failing Key: KAFKA-16866 URL: https://issues.apache.org/jira/browse/KAFKA-16866 Project: Kafka Issue Type: Test Affects Versions: 3.8.0 Reporter: Justine Olshan Seems like this test introduced in [https://github.com/apache/kafka/pull/15625] is failing consistently. org.opentest4j.AssertionFailedError: Expected :61 Actual :11 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16841) ZKIntegrationTests broken
Justine Olshan created KAFKA-16841: -- Summary: ZKIntegrationTests broken Key: KAFKA-16841 URL: https://issues.apache.org/jira/browse/KAFKA-16841 Project: Kafka Issue Type: Task Reporter: Justine Olshan A recent merge to trunk seems to have broken tests so that I see 78 failures in the CI. I see lots of timeout errors and `Alter Topic Configs had an error` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16990) Unrecognised flag passed to kafka-storage.sh in system test
[ https://issues.apache.org/jira/browse/KAFKA-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16990. Fix Version/s: 3.9.0 Resolution: Fixed > Unrecognised flag passed to kafka-storage.sh in system test > --- > > Key: KAFKA-16990 > URL: https://issues.apache.org/jira/browse/KAFKA-16990 > Project: Kafka > Issue Type: Test >Affects Versions: 3.8.0 >Reporter: Gaurav Narula > Assignee: Justine Olshan >Priority: Major > Fix For: 3.8.0, 3.9.0 > > > Running > {{TC_PATHS="tests/kafkatest/tests/core/kraft_upgrade_test.py::TestKRaftUpgrade" > bash tests/docker/run_tests.sh}} on trunk (c4a3d2475f) fails with the > following: > {code:java} > [INFO:2024-06-18 09:16:03,139]: Triggering test 2 of 32... > [INFO:2024-06-18 09:16:03,147]: RunnerClient: Loading test {'directory': > '/opt/kafka-dev/tests/kafkatest/tests/core', 'file_name': > 'kraft_upgrade_test.py', 'cls_name': 'TestKRaftUpgrade', 'method_name': > 'test_isolated_mode_upgrade', 'injected_args': {'from_kafka_version': > '3.1.2', 'use_new_coordinator': True, 'metadata_quorum': 'ISOLATED_KRAFT'}} > [INFO:2024-06-18 09:16:03,151]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > on run 1/1 > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Setting up... > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Running... > [INFO:2024-06-18 09:16:05,999]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Tearing down... > [INFO:2024-06-18 09:16:12,366]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > FAIL: RemoteCommandError({'ssh_config': {'host': 'ducker10', 'hostname': > 'ducker10', 'user': 'ducker', 'port': 22, 'password': '', 'identityfile': > '/home/ducker/.ssh/id_rsa', 'connecttimeout': None}, 'hostname': 'ducker10', > 'ssh_hostname': 'ducker10', 'user': 'ducker', 'externally_routable_ip': > 'ducker10', '_logger': kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT-2 > (DEBUG)>, 'os': 'linux', '_ssh_client': 0x85bccc70>, '_sftp_client': 0x85bccdf0>, '_custom_ssh_exception_checks': None}, > '/opt/kafka-3.1.2/bin/kafka-storage.sh format --ignore-formatted --config > /mnt/kafka/kafka.properties --cluster-id I2eXt9rvSnyhct8BYmW6-w -f > group.version=1', 1, b"usage: kafka-storage format [-h] --config CONFIG > --cluster-id CLUSTER_ID\n > [--ignore-formatted]\nkafka-storage: error: unrecognized arguments: '-f'\n") > Traceback (most recent call last): > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 186, in _do_run > data = self.run_test() > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 246, in run_test > return self.test_context.function(self.test) > File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line > 433, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 132, in test_isolated_mode_upgrade > self.run_upgrade(from_kafka_version, group_protocol) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 96, in run_upgrade > self.kafka.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 669, in > start > self.isolated_controller_quorum.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 671, in > start > Service.start(self, **kwargs) > File "/usr/local/lib/python3.9/dist-packages/ducktape/services/service.py", > line 265, in start >
[jira] [Created] (KAFKA-17050) Revert group.version for 3.8 and 3.9
Justine Olshan created KAFKA-17050: -- Summary: Revert group.version for 3.8 and 3.9 Key: KAFKA-17050 URL: https://issues.apache.org/jira/browse/KAFKA-17050 Project: Kafka Issue Type: Task Affects Versions: 3.8.0, 3.9.0 Reporter: Justine Olshan Assignee: Justine Olshan After much discussion for KAFKA-17011, we decided it would be best for 3.8 to just remove the group version feature for 3.8. As for 3.9, [~dajac] said it would be easier for EA users of the group coordinator to have a single way to configure. For 4.0 we can reintroduce it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (KAFKA-16990) Unrecognised flag passed to kafka-storage.sh in system test
[ https://issues.apache.org/jira/browse/KAFKA-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reopened KAFKA-16990: > Unrecognised flag passed to kafka-storage.sh in system test > --- > > Key: KAFKA-16990 > URL: https://issues.apache.org/jira/browse/KAFKA-16990 > Project: Kafka > Issue Type: Test >Affects Versions: 3.8.0 >Reporter: Gaurav Narula > Assignee: Justine Olshan >Priority: Blocker > Fix For: 3.8.0 > > > Running > {{TC_PATHS="tests/kafkatest/tests/core/kraft_upgrade_test.py::TestKRaftUpgrade" > bash tests/docker/run_tests.sh}} on trunk (c4a3d2475f) fails with the > following: > {code:java} > [INFO:2024-06-18 09:16:03,139]: Triggering test 2 of 32... > [INFO:2024-06-18 09:16:03,147]: RunnerClient: Loading test {'directory': > '/opt/kafka-dev/tests/kafkatest/tests/core', 'file_name': > 'kraft_upgrade_test.py', 'cls_name': 'TestKRaftUpgrade', 'method_name': > 'test_isolated_mode_upgrade', 'injected_args': {'from_kafka_version': > '3.1.2', 'use_new_coordinator': True, 'metadata_quorum': 'ISOLATED_KRAFT'}} > [INFO:2024-06-18 09:16:03,151]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > on run 1/1 > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Setting up... > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Running... > [INFO:2024-06-18 09:16:05,999]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Tearing down... > [INFO:2024-06-18 09:16:12,366]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > FAIL: RemoteCommandError({'ssh_config': {'host': 'ducker10', 'hostname': > 'ducker10', 'user': 'ducker', 'port': 22, 'password': '', 'identityfile': > '/home/ducker/.ssh/id_rsa', 'connecttimeout': None}, 'hostname': 'ducker10', > 'ssh_hostname': 'ducker10', 'user': 'ducker', 'externally_routable_ip': > 'ducker10', '_logger': kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT-2 > (DEBUG)>, 'os': 'linux', '_ssh_client': 0x85bccc70>, '_sftp_client': 0x85bccdf0>, '_custom_ssh_exception_checks': None}, > '/opt/kafka-3.1.2/bin/kafka-storage.sh format --ignore-formatted --config > /mnt/kafka/kafka.properties --cluster-id I2eXt9rvSnyhct8BYmW6-w -f > group.version=1', 1, b"usage: kafka-storage format [-h] --config CONFIG > --cluster-id CLUSTER_ID\n > [--ignore-formatted]\nkafka-storage: error: unrecognized arguments: '-f'\n") > Traceback (most recent call last): > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 186, in _do_run > data = self.run_test() > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 246, in run_test > return self.test_context.function(self.test) > File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line > 433, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 132, in test_isolated_mode_upgrade > self.run_upgrade(from_kafka_version, group_protocol) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 96, in run_upgrade > self.kafka.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 669, in > start > self.isolated_controller_quorum.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 671, in > start > Service.start(self, **kwargs) > File "/usr/local/lib/python3.9/dist-packages/ducktape/services/service.py", > line 265, in start > self.start_node(node, **kwargs) > File "/op
[jira] [Resolved] (KAFKA-16990) Unrecognised flag passed to kafka-storage.sh in system test
[ https://issues.apache.org/jira/browse/KAFKA-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16990. Resolution: Fixed > Unrecognised flag passed to kafka-storage.sh in system test > --- > > Key: KAFKA-16990 > URL: https://issues.apache.org/jira/browse/KAFKA-16990 > Project: Kafka > Issue Type: Test >Affects Versions: 3.8.0 >Reporter: Gaurav Narula > Assignee: Justine Olshan >Priority: Blocker > Fix For: 3.8.0 > > > Running > {{TC_PATHS="tests/kafkatest/tests/core/kraft_upgrade_test.py::TestKRaftUpgrade" > bash tests/docker/run_tests.sh}} on trunk (c4a3d2475f) fails with the > following: > {code:java} > [INFO:2024-06-18 09:16:03,139]: Triggering test 2 of 32... > [INFO:2024-06-18 09:16:03,147]: RunnerClient: Loading test {'directory': > '/opt/kafka-dev/tests/kafkatest/tests/core', 'file_name': > 'kraft_upgrade_test.py', 'cls_name': 'TestKRaftUpgrade', 'method_name': > 'test_isolated_mode_upgrade', 'injected_args': {'from_kafka_version': > '3.1.2', 'use_new_coordinator': True, 'metadata_quorum': 'ISOLATED_KRAFT'}} > [INFO:2024-06-18 09:16:03,151]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > on run 1/1 > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Setting up... > [INFO:2024-06-18 09:16:03,153]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Running... > [INFO:2024-06-18 09:16:05,999]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > Tearing down... > [INFO:2024-06-18 09:16:12,366]: RunnerClient: > kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT: > FAIL: RemoteCommandError({'ssh_config': {'host': 'ducker10', 'hostname': > 'ducker10', 'user': 'ducker', 'port': 22, 'password': '', 'identityfile': > '/home/ducker/.ssh/id_rsa', 'connecttimeout': None}, 'hostname': 'ducker10', > 'ssh_hostname': 'ducker10', 'user': 'ducker', 'externally_routable_ip': > 'ducker10', '_logger': kafkatest.tests.core.kraft_upgrade_test.TestKRaftUpgrade.test_isolated_mode_upgrade.from_kafka_version=3.1.2.use_new_coordinator=True.metadata_quorum=ISOLATED_KRAFT-2 > (DEBUG)>, 'os': 'linux', '_ssh_client': 0x85bccc70>, '_sftp_client': 0x85bccdf0>, '_custom_ssh_exception_checks': None}, > '/opt/kafka-3.1.2/bin/kafka-storage.sh format --ignore-formatted --config > /mnt/kafka/kafka.properties --cluster-id I2eXt9rvSnyhct8BYmW6-w -f > group.version=1', 1, b"usage: kafka-storage format [-h] --config CONFIG > --cluster-id CLUSTER_ID\n > [--ignore-formatted]\nkafka-storage: error: unrecognized arguments: '-f'\n") > Traceback (most recent call last): > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 186, in _do_run > data = self.run_test() > File > "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", > line 246, in run_test > return self.test_context.function(self.test) > File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line > 433, in wrapper > return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 132, in test_isolated_mode_upgrade > self.run_upgrade(from_kafka_version, group_protocol) > File "/opt/kafka-dev/tests/kafkatest/tests/core/kraft_upgrade_test.py", > line 96, in run_upgrade > self.kafka.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 669, in > start > self.isolated_controller_quorum.start() > File "/opt/kafka-dev/tests/kafkatest/services/kafka/kafka.py", line 671, in > start > Service.start(self, **kwargs) > File "/usr/local/lib/python3.9/dist-packages/ducktape/services/service.py", > line 265, in start > self.start_node(node, **kwargs) >
[jira] [Resolved] (KAFKA-17011) SupportedFeatures.MinVersion incorrectly blocks v0
[ https://issues.apache.org/jira/browse/KAFKA-17011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-17011. Resolution: Fixed > SupportedFeatures.MinVersion incorrectly blocks v0 > -- > > Key: KAFKA-17011 > URL: https://issues.apache.org/jira/browse/KAFKA-17011 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.8.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Critical > Fix For: 3.9.0 > > > SupportedFeatures.MinVersion incorrectly blocks v0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-17250) Many system tests failing with org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default replicaDirectoryId at version 13
Justine Olshan created KAFKA-17250: -- Summary: Many system tests failing with org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default replicaDirectoryId at version 13 Key: KAFKA-17250 URL: https://issues.apache.org/jira/browse/KAFKA-17250 Project: Kafka Issue Type: Task Affects Versions: 3.9.0 Reporter: Justine Olshan I see a lot of kraft system tests that test different versions failing with this error. -- This message was sent by Atlassian Jira (v8.20.10#820010)