[jira] [Updated] (KAFKA-16041) Replace Afterburn module with Blackbird
[ https://issues.apache.org/jira/browse/KAFKA-16041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-16041: Component/s: KafkaConnect (was: connect) > Replace Afterburn module with Blackbird > --- > > Key: KAFKA-16041 > URL: https://issues.apache.org/jira/browse/KAFKA-16041 > Project: Kafka > Issue Type: Task > Components: KafkaConnect >Reporter: Mario Fiore Vitale >Priority: Major > Fix For: 4.0.0 > > > [Blackbird|https://github.com/FasterXML/jackson-modules-base/blob/master/blackbird/README.md] > is the Afterburn replacement for Java 11+ -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15976) KIP-995: Allow users to specify initial offsets while creating connectors
[ https://issues.apache.org/jira/browse/KAFKA-15976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15976: Component/s: KafkaConnect (was: connect) > KIP-995: Allow users to specify initial offsets while creating connectors > - > > Key: KAFKA-15976 > URL: https://issues.apache.org/jira/browse/KAFKA-15976 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Ashwin Pankaj >Assignee: Ashwin Pankaj >Priority: Major > > Allow setting the initial offset for a connector in the connector creation > API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16051) Deadlock on connector initialization
[ https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-16051: Component/s: KafkaConnect (was: connect) > Deadlock on connector initialization > > > Key: KAFKA-16051 > URL: https://issues.apache.org/jira/browse/KAFKA-16051 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 2.6.3, 3.6.1 >Reporter: Octavian Ciubotaru >Priority: Major > > > Tested with Kafka 3.6.1 and 2.6.3. > The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4. > Stack trace for Kafka 3.6.1: > {noformat} > Found one Java-level deadlock: > = > "pool-3-thread-1": > waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder), > which is held by "Thread-9" > "Thread-9": > waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a > org.apache.kafka.connect.storage.MemoryConfigBackingStore), > which is held by "pool-3-thread-1"Java stack information for the threads > listed above: > === > "pool-3-thread-1": > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516) > - waiting to lock <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137) > - locked <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.21/Thread.java:829) > "Thread-9": > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129) > - waiting to lock <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255) > - locked <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50) > at > org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548) > at > io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86) > Found 1 deadlock. > {noformat} > The jdbc source connector is loading tables from the database and updates the > configuration once the list is available. The deadlock is very consistent in > my environment, probably because the database is on the same machine. > Maybe it is possible to avoid this situation by always locking the herder > first and the config backing store second. From what I see, > updateConnectorTasks sometimes is called before locking on herder and other > times it is not. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15888) DistributedHerder log context should not use the same client ID for each Connect worker by default
[ https://issues.apache.org/jira/browse/KAFKA-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15888: Component/s: (was: connect) > DistributedHerder log context should not use the same client ID for each > Connect worker by default > -- > > Key: KAFKA-15888 > URL: https://issues.apache.org/jira/browse/KAFKA-15888 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Yash Mayya >Assignee: Yash Mayya >Priority: Minor > Fix For: 3.7.0 > > > By default, if there is no "{{{}client.id"{}}} configured on a Connect worker > running in distributed mode, the same client ID ("connect-1") will be used in > the log context for the DistributedHerder class in every single worker in the > Connect cluster. This default is quite confusing and obviously not very > useful. Further, based on how this default is configured > ([ref|https://github.com/apache/kafka/blob/150b0e8290cda57df668ba89f6b422719866de5a/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L299]), > it seems like this might have been an unintentional bug. We could simply use > the workerId (the advertised host name and port of the worker) by default > instead, which should be unique for each worker in a cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15841) Add Support for Topic-Level Partitioning in Kafka Connect
[ https://issues.apache.org/jira/browse/KAFKA-15841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15841: Component/s: KafkaConnect (was: connect) > Add Support for Topic-Level Partitioning in Kafka Connect > - > > Key: KAFKA-15841 > URL: https://issues.apache.org/jira/browse/KAFKA-15841 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Reporter: Henrique Mota >Priority: Trivial > > In our organization, we utilize JDBC sink connectors to consume data from > various topics, where each topic is dedicated to a specific tenant with a > single partition. Recently, we developed a custom sink based on the standard > JDBC sink, enabling us to pause consumption of a topic when encountering > problematic records. > However, we face limitations within Kafka Connect, as it doesn't allow for > appropriate partitioning of topics among workers. We attempted a workaround > by breaking down the topics list within the 'topics' parameter. > Unfortunately, Kafka Connect overrides this parameter after invoking the > {{taskConfigs(int maxTasks)}} method from the > {{org.apache.kafka.connect.connector.Connector}} class. > We request the addition of support in Kafka Connect to enable the > partitioning of topics among workers without requiring a fork. This > enhancement would facilitate better load distribution and allow for more > flexible configurations, particularly in scenarios where topics are dedicated > to different tenants. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15524) Flaky test org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testResetSinkConnectorOffsetsZombieSinkTasks
[ https://issues.apache.org/jira/browse/KAFKA-15524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15524: Component/s: KafkaConnect (was: connect) > Flaky test > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testResetSinkConnectorOffsetsZombieSinkTasks > -- > > Key: KAFKA-15524 > URL: https://issues.apache.org/jira/browse/KAFKA-15524 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0, 3.5.1 >Reporter: Josep Prat >Priority: Major > Labels: flaky, flaky-test > > Last seen: > [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14458/3/testReport/junit/org.apache.kafka.connect.integration/OffsetsApiIntegrationTest/Build___JDK_17_and_Scala_2_13___testResetSinkConnectorOffsetsZombieSinkTasks/] > > h3. Error Message > {code:java} > java.lang.RuntimeException: java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: The request timed out.{code} > h3. Stacktrace > {code:java} > java.lang.RuntimeException: java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: The request timed out. at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:427) > at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:401) > at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:392) > at > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testResetSinkConnectorOffsetsZombieSinkTasks(OffsetsApiIntegrationTest.java:763) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at > org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.ParentRunner.run(ParentRunner.java:413) at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:112) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:40) > at > org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:60) > at > org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:52) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36) > at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) > at >
[jira] [Updated] (KAFKA-15570) Add unit tests for MemoryConfigBackingStore
[ https://issues.apache.org/jira/browse/KAFKA-15570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15570: Component/s: (was: connect) > Add unit tests for MemoryConfigBackingStore > --- > > Key: KAFKA-15570 > URL: https://issues.apache.org/jira/browse/KAFKA-15570 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Yash Mayya >Assignee: Yash Mayya >Priority: Minor > > Currently, the > [MemoryConfigBackingStore|https://github.com/apache/kafka/blob/6e164bb9ace3ea7a1a9542904d1a01c9fd3a1b48/connect/runtime/src/main/java/org/apache/kafka/connect/storage/MemoryConfigBackingStore.java#L37] > class doesn't have any unit tests for its functionality. While most of its > functionality is fairly lightweight today, changes will be introduced with > [KIP-980|https://cwiki.apache.org/confluence/display/KAFKA/KIP-980%3A+Allow+creating+connectors+in+a+stopped+state] > (potentially > [KIP-976|https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect] > as well) and it would be good to have a test setup in place before those > changes are made. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15407) Not able to connect to kafka from the Private NLB from outside the VPC account
[ https://issues.apache.org/jira/browse/KAFKA-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15407: Component/s: KafkaConnect (was: connect) > Not able to connect to kafka from the Private NLB from outside the VPC > account > --- > > Key: KAFKA-15407 > URL: https://issues.apache.org/jira/browse/KAFKA-15407 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, KafkaConnect, producer , protocol > Environment: Staging, PROD >Reporter: Shivakumar >Priority: Blocker > Attachments: image-2023-08-28-12-37-33-100.png > > > !image-2023-08-28-12-37-33-100.png|width=768,height=223! > Problem statement : > We are trying to connect Kafka from another account/VPC account > Our kafka is in EKS cluster , we have service pointing to these pods for > connection > We tried to create private link endpoint form Account B to connect to our NLB > to connect to our Kafka in Account A > We see the connection reset from both client and target(kafka) in the NLB > monitoring tab of AWS. > We tried various combo of listeners and advertised listeners which did not > help us. > We are assuming we are missing some combination of Listeners and Network > level configs with which this connection can be made > Can you please guide us with this as we are blocked with a major migration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15470) Allow creating connectors in a stopped state
[ https://issues.apache.org/jira/browse/KAFKA-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15470: Component/s: (was: connect) > Allow creating connectors in a stopped state > > > Key: KAFKA-15470 > URL: https://issues.apache.org/jira/browse/KAFKA-15470 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect >Reporter: Yash Mayya >Assignee: Yash Mayya >Priority: Major > Labels: connect, kafka-connect, kip-required > Fix For: 3.7.0 > > > [KIP-875: First-class offsets support in Kafka > Connect|https://cwiki.apache.org/confluence/display/KAFKA/KIP-875%3A+First-class+offsets+support+in+Kafka+Connect] > introduced a new {{STOPPED}} state for connectors along with some REST API > endpoints to retrieve and modify offsets for connectors. Currently, only > connectors that already exist can be stopped and any newly created connector > will always be in the {{RUNNING}} state initially. Allowing the creation of > connectors in a {{STOPPED}} state will facilitate multiple new use cases. One > interesting use case would be to migrate connectors from one Kafka Connect > cluster to another. Individual connector migration would be useful in a > number of scenarios such as breaking a large cluster into multiple smaller > clusters (or vice versa), moving a connector from a cluster running in one > data center to another etc. A connector migration could be achieved by using > the following sequence of steps :- > # Stop the running connector on the original Kafka Connect cluster > # Retrieve the offsets for the connector via the {{GET > /connectors/\{connector}/offsets}} endpoint > # Create the connector in a stopped state using the same configuration on > the new Kafka Connect cluster > # Alter the offsets for the connector on the new cluster via the {{PATCH > /connectors/\{connector}/offsets}} endpoint (using the offsets obtained from > the original cluster) > # Resume the connector on the new cluster and delete it on the original > cluster > Another use case for creating connectors in a stopped state could be > deploying connectors as a part of a larger data pipeline before the source / > sink data system has been created or is ready for data transfer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15339) Transient I/O error happening in appending records could lead to the halt of whole cluster
[ https://issues.apache.org/jira/browse/KAFKA-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15339: Component/s: KafkaConnect (was: connect) > Transient I/O error happening in appending records could lead to the halt of > whole cluster > -- > > Key: KAFKA-15339 > URL: https://issues.apache.org/jira/browse/KAFKA-15339 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect, producer >Affects Versions: 3.5.0 >Reporter: Haoze Wu >Priority: Major > > We are running an integration test in which we start an Embedded Connect > Cluster in the active 3.5 branch. However, because of transient disk error, > we may encounter an IOException during appending records to one topic. As > shown in the stack trace: > {code:java} > [2023-08-13 16:53:51,016] ERROR Error while appending records to > connect-config-topic-connect-cluster-0 in dir > /tmp/EmbeddedKafkaCluster8003464883598783225 > (org.apache.kafka.storage.internals.log.LogDirFailureChannel:61) > java.io.IOException: > at > org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92) > at > org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188) > at kafka.log.LogSegment.append(LogSegment.scala:161) > at kafka.log.LocalLog.append(LocalLog.scala:436) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:853) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:664) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1281) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1269) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:977) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:965) > at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:623) > at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:680) > at kafka.server.KafkaApis.handle(KafkaApis.scala:180) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76) > at java.lang.Thread.run(Thread.java:748) {code} > However, just because of failing to append the records to one partition. The > fetcher for all the other partitions are removed, broker shutdown, and > finally embedded connect cluster killed as whole. > {code:java} > [2023-08-13 17:35:37,966] WARN Stopping serving logs in dir > /tmp/EmbeddedKafkaCluster6777164631574762227 (kafka.log.LogManager:70) > [2023-08-13 17:35:37,968] ERROR Shutdown broker because all log dirs in > /tmp/EmbeddedKafkaCluster6777164631574762227 have failed > (kafka.log.LogManager:143) > [2023-08-13 17:35:37,968] WARN Abrupt service halt with code 1 and message > null (org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster:130) > [2023-08-13 17:35:37,968] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler:135) > org.apache.kafka.connect.util.clusters.UngracefulShutdownException: Abrupt > service halt with code 1 and message null {code} > I am wondering if we could add configurable retry around the root cause to > tolerate the possible I/O faults so that if the retry is successful, the > embedded connect cluster could still operate. > Any comments and suggestions would be appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15680) Partition-Count is not getting updated Correctly in the Incremental Co-operative Rebalancing(ICR) Mode of Rebalancing
[ https://issues.apache.org/jira/browse/KAFKA-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15680: Component/s: KafkaConnect (was: connect) > Partition-Count is not getting updated Correctly in the Incremental > Co-operative Rebalancing(ICR) Mode of Rebalancing > - > > Key: KAFKA-15680 > URL: https://issues.apache.org/jira/browse/KAFKA-15680 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.0.1 >Reporter: Pritam Kumar >Assignee: Pritam Kumar >Priority: Minor > Fix For: 3.7.0, 3.6.1 > > > * In ICR(Incremental Cooperative Rebalancing) mode, whenever a new worker, > say Worker 3 joins, a new global assignment is computed by the leader, say > Worker1, that results in the revocation of some tasks from each existing > worker i.e Worker1 and Worker2. > * Once the new member join is completed, > *ConsumerCoordinator.OnJoinComplete()* method is called which primarily > computes all the new partitions assigned and the partitions which are revoked > and updates the subscription Object. > * If it was the case of revocation which we check by checking the > “partitonsRevoked” list, we call the method {*}“invoke{*}PartitionRevoked()” > which internally calls “updatePartitionCount()” which fetches partition from > the *assignment* object which is yet not updated by the new assignment. > * It is only just before calling the “{*}invokePartitionsAssigned{*}()” > method that we update the *assignment* by invoking the following → > *subscriptions.assignFromSubscribed(assignedPartitions);* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15855) RFC 9266: Channel Bindings for TLS 1.3 support | SCRAM-SHA-*-PLUS variants
[ https://issues.apache.org/jira/browse/KAFKA-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15855: Component/s: KafkaConnect (was: connect) > RFC 9266: Channel Bindings for TLS 1.3 support | SCRAM-SHA-*-PLUS variants > -- > > Key: KAFKA-15855 > URL: https://issues.apache.org/jira/browse/KAFKA-15855 > Project: Kafka > Issue Type: Bug > Components: core, KafkaConnect, security >Reporter: Neustradamus >Priority: Critical > Labels: security > > Dear Apache, and Kafka teams, > Can you add the support of RFC 9266: Channel Bindings for TLS 1.3? > - [https://datatracker.ietf.org/doc/html/rfc9266] > Little details, to know easily: > - tls-unique for TLS =< 1.2 > - tls-server-end-point > - tls-exporter for TLS = 1.3 > It is needed for SCRAM-SHA-*-PLUS variants. > Note: Some SCRAM-SHA are already supported. > I think that you have seen the jabber.ru MITM and Channel Binding is the > solution: > - [https://notes.valdikss.org.ru/jabber.ru-mitm/] > - [https://snikket.org/blog/on-the-jabber-ru-mitm/] > - [https://www.devever.net/~hl/xmpp-incident] > - [https://blog.jmp.chat/b/certwatch] > IETF links: > SCRAM-SHA-1(-PLUS): > - RFC5802: Salted Challenge Response Authentication Mechanism (SCRAM) SASL > and GSS-API Mechanisms: [https://tools.ietf.org/html/rfc5802] // July 2010 > - RFC6120: Extensible Messaging and Presence Protocol (XMPP): Core: > [https://tools.ietf.org/html/rfc6120] // March 2011 > SCRAM-SHA-256(-PLUS): > - RFC7677: SCRAM-SHA-256 and SCRAM-SHA-256-PLUS Simple Authentication and > Security Layer (SASL) Mechanisms: [https://tools.ietf.org/html/rfc7677] // > 2015-11-02 > - RFC8600: Using Extensible Messaging and Presence Protocol (XMPP) for > Security Information Exchange: [https://tools.ietf.org/html/rfc8600] // > 2019-06-21: > [https://mailarchive.ietf.org/arch/msg/ietf-announce/suJMmeMhuAOmGn_PJYgX5Vm8lNA] > SCRAM-SHA-512(-PLUS): > - [https://tools.ietf.org/html/draft-melnikov-scram-sha-512] > SCRAM-SHA3-512(-PLUS): > - [https://tools.ietf.org/html/draft-melnikov-scram-sha3-512] > SCRAM BIS: Salted Challenge Response Authentication Mechanism (SCRAM) SASL > and GSS-API Mechanisms: > - [https://tools.ietf.org/html/draft-melnikov-scram-bis] > -PLUS variants: > - RFC5056: On the Use of Channel Bindings to Secure Channels: > [https://tools.ietf.org/html/rfc5056] // November 2007 > - RFC5929: Channel Bindings for TLS: [https://tools.ietf.org/html/rfc5929] // > July 2010 > - Channel-Binding Types: > [https://www.iana.org/assignments/channel-binding-types/channel-binding-types.xhtml] > - RFC9266: Channel Bindings for TLS 1.3: > [https://tools.ietf.org/html/rfc9266] // July 2022 > IMAP: > - RFC9051: Internet Message Access Protocol (IMAP) - Version 4rev2: > [https://tools.ietf.org/html/rfc9051] // August 2021 > LDAP: > - RFC5803: Lightweight Directory Access Protocol (LDAP) Schema for Storing > Salted: Challenge Response Authentication Mechanism (SCRAM) Secrets: > [https://tools.ietf.org/html/rfc5803] // July 2010 > HTTP: > - RFC7804: Salted Challenge Response HTTP Authentication Mechanism: > [https://tools.ietf.org/html/rfc7804] // March 2016 > JMAP: > - RFC8621: The JSON Meta Application Protocol (JMAP) for Mail: > [https://tools.ietf.org/html/rfc8621] // August 2019 > 2FA: > - Extensions to Salted Challenge Response (SCRAM) for 2 factor > authentication: [https://tools.ietf.org/html/draft-ietf-kitten-scram-2fa] > Thanks in advance. > Linked to: > - [https://github.com/scram-sasl/info/issues/1] > Note: This ticket can be for other Apache projects too. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15335) Support custom SSL configuration for Kafka Connect RestServer
[ https://issues.apache.org/jira/browse/KAFKA-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15335: Component/s: KafkaConnect (was: connect) > Support custom SSL configuration for Kafka Connect RestServer > - > > Key: KAFKA-15335 > URL: https://issues.apache.org/jira/browse/KAFKA-15335 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect >Reporter: Taras Ledkov >Priority: Major > > Root issue to track > [KIP-967|https://cwiki.apache.org/confluence/display/KAFKA/KIP-967%3A+Support+custom+SSL+configuration+for+Kafka+Connect+RestServer] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15310) Add timezone configuration option in TimestampConverter from connectors
[ https://issues.apache.org/jira/browse/KAFKA-15310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15310: Component/s: KafkaConnect (was: connect) > Add timezone configuration option in TimestampConverter from connectors > --- > > Key: KAFKA-15310 > URL: https://issues.apache.org/jira/browse/KAFKA-15310 > Project: Kafka > Issue Type: New Feature > Components: config, KafkaConnect >Reporter: Romulo Souza >Priority: Minor > Labels: needs-kip > Attachments: Captura de tela de 2023-08-05 09-43-54-1.png, Captura de > tela de 2023-08-05 09-44-25-1.png > > > In some cenarios where the use of TimestampConverter happens, it's > interesting to have an option to determine a specific timezone other than UTC > (hardcoded). E.g., there are use cases where a sink connector sends data to a > database and this same data is used in analysis tool without formatting and > transformation options. > It should be added a new Kafka Connector's optional configuration to set the > desired timezone with a fallback to UTC when not informed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15208) Upgrade Jackson dependencies to version 2.16.0
[ https://issues.apache.org/jira/browse/KAFKA-15208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15208: Component/s: KafkaConnect (was: connect) > Upgrade Jackson dependencies to version 2.16.0 > -- > > Key: KAFKA-15208 > URL: https://issues.apache.org/jira/browse/KAFKA-15208 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect, kraft >Reporter: Said BOUDJELDA >Assignee: Said BOUDJELDA >Priority: Major > Labels: dependencies > Fix For: 3.7.0 > > > Upgrading the version of Jackson dependencies to the latest stable version > 2.16.0 can bring much bug fixing security issues solving and performance > improvement > > Check release notes back to the current version > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.16.0] > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15203) Remove dependency on Reflections
[ https://issues.apache.org/jira/browse/KAFKA-15203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15203: Component/s: KafkaConnect (was: connect) > Remove dependency on Reflections > - > > Key: KAFKA-15203 > URL: https://issues.apache.org/jira/browse/KAFKA-15203 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Divij Vaidya >Priority: Major > Labels: newbie > > We currently depend on reflections library which is EOL. Quoting from the > GitHub site: > _> Please note: Reflections library is currently NOT under active development > or maintenance_ > > This poses a supply chain risk for our project where the security fixes and > other major bugs in underlying dependency may not be addressed timely. > Hence, we should plan to remove this dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15333) Flaky build failure throwing Connect Exception: Could not connect to server....
[ https://issues.apache.org/jira/browse/KAFKA-15333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15333: Component/s: KafkaConnect (was: connect) > Flaky build failure throwing Connect Exception: Could not connect to > server > --- > > Key: KAFKA-15333 > URL: https://issues.apache.org/jira/browse/KAFKA-15333 > Project: Kafka > Issue Type: Test > Components: KafkaConnect, unit tests >Reporter: Philip Nee >Priority: Major > > We frequently observe flaky build failure with the following message. The is > from the most recent PR post 3.5.0: > > {code:java} > > Task :generator:testClasses UP-TO-DATE > Unexpected exception thrown. > org.gradle.internal.remote.internal.MessageIOException: Could not read > message from '/127.0.0.1:38354'. > at > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:94) > at > org.gradle.internal.remote.internal.hub.MessageHub$ConnectionReceive.run(MessageHub.java:270) > at > org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64) > at > org.gradle.internal.concurrent.AbstractManagedExecutor$1.run(AbstractManagedExecutor.java:47) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.lang.IllegalArgumentException > at > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:72) > at > org.gradle.internal.remote.internal.hub.InterHubMessageSerializer$MessageReader.read(InterHubMessageSerializer.java:52) > at > org.gradle.internal.remote.internal.inet.SocketConnection.receive(SocketConnection.java:81) > ... 6 more > > Task :streams:upgrade-system-tests-26:unitTest > org.gradle.internal.remote.internal.ConnectException: Could not connect to > server [3156f144-9a89-4c47-91ad-88a8378ec726 port:37889, > addresses:[/127.0.0.1]]. Tried addresses: [/127.0.0.1]. > at > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:67) > at > org.gradle.internal.remote.internal.hub.MessageHubBackedClient.getConnection(MessageHubBackedClient.java:36) > at > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:103) > at > org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:65) > at > worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69) > at > worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:122) > at > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.tryConnect(TcpOutgoingConnector.java:81) > at > org.gradle.internal.remote.internal.inet.TcpOutgoingConnector.connect(TcpOutgoingConnector.java:54) > ... 5 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16051) Deadlock on connector initialization
[ https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800522#comment-17800522 ] Greg Harris commented on KAFKA-16051: - Hi [~developster] Thanks for the report! that certainly looks like a deadlock that the standalone implementation should avoid. If you are interested, please assign this ticket to yourself and tag me to review your PR! If you're just looking for a workaround, I would suggest using the distributed mode. It shouldn't have the same deadlock you're seeing, and is more similar to typical production environments. > Deadlock on connector initialization > > > Key: KAFKA-16051 > URL: https://issues.apache.org/jira/browse/KAFKA-16051 > Project: Kafka > Issue Type: Bug > Components: connect >Affects Versions: 2.6.3, 3.6.1 >Reporter: Octavian Ciubotaru >Priority: Major > > > Tested with Kafka 3.6.1 and 2.6.3. > The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4. > Stack trace for Kafka 3.6.1: > {noformat} > Found one Java-level deadlock: > = > "pool-3-thread-1": > waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder), > which is held by "Thread-9" > "Thread-9": > waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a > org.apache.kafka.connect.storage.MemoryConfigBackingStore), > which is held by "pool-3-thread-1"Java stack information for the threads > listed above: > === > "pool-3-thread-1": > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516) > - waiting to lock <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137) > - locked <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.21/Thread.java:829) > "Thread-9": > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129) > - waiting to lock <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255) > - locked <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50) > at > org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548) > at > io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86) > Found 1 deadlock. > {noformat} > The jdbc source connector is loading tables from the database and updates the > configuration once the list is available. The deadlock is very consistent in > my environment, probably because the database is on the same machine. > Maybe it is possible to avoid this situation by always locking the herder > first and the config backing store second. From what I see, > updateConnectorTasks sometimes is called before locking on herder and other > times it is not. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16047) Source connector with EOS enabled have some InitProducerId requests timing out, effectively failing all the tasks & the whole connector
[ https://issues.apache.org/jira/browse/KAFKA-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800520#comment-17800520 ] Greg Harris commented on KAFKA-16047: - It appears that the transaction timeout is used as the timeout to produce records to the __transaction_state topic in TransactionStateManager#appendTransactionToLog [https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L769-L770] which is in turn used during init, end, and add-partitions in TransactionCoordinator: [https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L199] I see that after expiration, TransactionStateManager#writeTombstonesForExpiredTransactionalIds uses the request.timeout.ms: [https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/core/src/main/scala/kafka/coordinator/transaction/TransactionStateManager.scala#L283-L284] Should the TransactionStateManager be using the transaction timeout ms for a single-operation, or should the request timeout be used throughout? Using the transaction timeout is certainly shorter than we expect, but I wonder if that's also longer than people expect in other situations. When the producer makes these requests, it only waits for max.block.ms (default 60s) for them to complete. After that point, the producer times out the request, while the broker may be left waiting for the transaction timeout (default 60s) to expire. * We can fix this on the broker side by changing the produce timeout to the value of "max(transaction timeout, request timeout)". Someone may have increased their max.block.ms & transaction timeout while keeping the request timeout short, and would still desire that the init call block for up to max.block.ms. * We can fix this on the client side by changing the 1ms hardcoded value to use the same timeout as the overall fenceProducers request, which is either the default request timeout (configurable) or specified via the Options argument. > Source connector with EOS enabled have some InitProducerId requests timing > out, effectively failing all the tasks & the whole connector > --- > > Key: KAFKA-16047 > URL: https://issues.apache.org/jira/browse/KAFKA-16047 > Project: Kafka > Issue Type: Bug > Components: connect, mirrormaker >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2, 3.4.1, 3.6.0, 3.5.1, 3.5.2, > 3.6.1 >Reporter: Angelos Kaltsikis >Priority: Major > > Source Connectors with 'exactly.once.support = required' may have some of > their tasks that issue InitProducerId requests from the admin client timeout. > In the case of MirrorSourceConnector, which was the source connector that i > found the bug, the bug was effectively making all the tasks (in the specific > case of) become "FAILED". As soon as one of the tasks gets FAILED due to the > 'COORDINATOR_NOT_AVAILABLE' messages (due to timeouts), no matter how many > restarts i did to the connector/tasks, i couldn't get the > MirrorSourceConnector in a healthy RUNNING state again. > Due to the low timeout that has been [hard-coded in the > code|https://github.com/apache/kafka/blob/3.6.1/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L87] > (1ms), there is a chance that the `InitProducerId` requests timeout in case > of "slower-than-expected" Kafka brokers (that do not process & respond to the > above request in <= 1ms). (feel free to read more information about the issue > in the "More Context" section below) > [~ChrisEgerton] I would appreciate it if you could respond to the following > questions > - How and why was the 1ms magic number for transaction timeout has to be > chosen? > - Is there any specific reason that it can be guaranteed that the > `InitProducerId` request can be processed in such a small time window? > - I have tried the above in multiple different Kafka clusters that are hosted > in different underlying datacenter hosts and i don't believe that those > brokers are "slow" for some reason. If you feel that the brokers are slower > than expected, i would appreciate any pointers on how could i find out what > is the bottleneck > h3. Temporary Mitigation > I have increased the timeout to 1000ms (randomly picked this number, just > wanted to give enough time to brokers to always complete those type of > requests). It fix can be found in my fork > https://github.com/akaltsikis/kafka/commit/8a47992e7dc63954f9d9ac54e8ed1f5a7737c97f > > h3. Final solution > The temporary mitigation is not ideal, as it still
[jira] [Resolved] (KAFKA-15111) Correction kafka examples
[ https://issues.apache.org/jira/browse/KAFKA-15111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-15111. - Resolution: Duplicate > Correction kafka examples > - > > Key: KAFKA-15111 > URL: https://issues.apache.org/jira/browse/KAFKA-15111 > Project: Kafka > Issue Type: Task >Reporter: Dmitry >Priority: Minor > Fix For: 3.6.0 > > > Need set TOPIC_NAME = topic1 in KafkaConsumerProducerDemo class and remove > unused TOPIC field from KafkaProperties. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15111) Correction kafka examples
[ https://issues.apache.org/jira/browse/KAFKA-15111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15111: Fix Version/s: 3.6.0 (was: 3.7.0) > Correction kafka examples > - > > Key: KAFKA-15111 > URL: https://issues.apache.org/jira/browse/KAFKA-15111 > Project: Kafka > Issue Type: Task >Reporter: Dmitry >Priority: Minor > Fix For: 3.6.0 > > > Need set TOPIC_NAME = topic1 in KafkaConsumerProducerDemo class and remove > unused TOPIC field from KafkaProperties. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795958#comment-17795958 ] Greg Harris commented on KAFKA-15372: - This is now set to release for 3.7 and 3.6, but I had some issues with the 3.5 backport that I had to revert. In particular, the DedicatedMirrorTest has this persistent failure: {noformat} org.apache.kafka.test.NoRetryException at app//org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest.lambda$awaitTaskConfigurations$8(DedicatedMirrorIntegrationTest.java:363) at app//org.apache.kafka.test.TestUtils.lambda$waitForCondition$4(TestUtils.java:337) at app//org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:334) at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:318) at app//org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:308) at app//org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest.awaitTaskConfigurations(DedicatedMirrorIntegrationTest.java:353) at app//org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest.testMultiNodeCluster(DedicatedMirrorIntegrationTest.java:301) Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.connect.runtime.distributed.RebalanceNeededException: Request cannot be completed because a rebalance is expected at org.apache.kafka.connect.util.ConvertingFutureCallback.result(ConvertingFutureCallback.java:123) at org.apache.kafka.connect.util.ConvertingFutureCallback.get(ConvertingFutureCallback.java:115) at org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest.lambda$awaitTaskConfigurations$8(DedicatedMirrorIntegrationTest.java:357) ... 7 more Caused by: org.apache.kafka.connect.runtime.distributed.RebalanceNeededException: Request cannot be completed because a rebalance is expected{noformat} > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > Fix For: 3.7.0, 3.6.2 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15372: Fix Version/s: 3.6.2 > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > Fix For: 3.7.0, 3.6.2 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15906) Emit offset syncs more often than offset.lag.max for low-throughput/finite partitions
[ https://issues.apache.org/jira/browse/KAFKA-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris reassigned KAFKA-15906: --- Assignee: Greg Harris > Emit offset syncs more often than offset.lag.max for low-throughput/finite > partitions > - > > Key: KAFKA-15906 > URL: https://issues.apache.org/jira/browse/KAFKA-15906 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > Right now, the offset.lag.max configuration limits the number of offset syncs > are emitted by the MirrorSourceTask, along with a fair rate-limiting > semaphore. After 100 records have been emitted for a partition, _and_ the > semaphore is available, an offset sync can be emitted. > For low-volume topics, the `offset.lag.max` default of 100 is much more > restrictive than the rate-limiting semaphore. For example, a topic which > mirrors at the rate of 1 record/sec may take 100 seconds to receive an offset > sync. If the topic is actually finite, the last offset sync will never > arrive, and the translation will have a persistent lag. > Instead, we can periodically flush the offset syncs for partitions that are > under the offset.lag.max limit, but have not received an offset sync > recently. This could be a new configuration, be a hard-coded time, or be > based on the existing emit.checkpoints.interval.seconds and > sync.group.offsets.interval.seconds configurations. > > Alternatively, we could decrease the default `offset.lag.max` value to 0, and > rely on the fair semaphore to limit the number of syncs emitted for > high-throughput partitions. The semaphore is not currently configurable, so > users wanting lower throughput on the offset-syncs topic will still need an > offset.lag.max > 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15985) Mirrormaker 2 offset sync is incomplete
[ https://issues.apache.org/jira/browse/KAFKA-15985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794433#comment-17794433 ] Greg Harris commented on KAFKA-15985: - Hi [~Reamer] The default value of `offset.lag.max` is 100, which could cause the 31 < 100 lag you're seeing. You can work-around this by setting `offset.lag.max` to `0`. We're exploring a fix to this in KAFKA-15906. > Mirrormaker 2 offset sync is incomplete > --- > > Key: KAFKA-15985 > URL: https://issues.apache.org/jira/browse/KAFKA-15985 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.5.1 >Reporter: Philipp Dallig >Priority: Major > > We are currently trying to migrate between two Kafka clusters using > Mirrormaker2 > new kafka cluster version: 7.5.2-ccs > old kafka cluster version: kafka_2.13-2.8.0 > The Mirrormaker 2 process runs on the new cluster (target cluster). > My main problem: The lag in the target cluster is not the same as in the > source cluster. > I have set up a producer and consumer against the old Kafka cluster. If I > stop both. The lag in the old Kafka cluster is 0, while it is > 0 in the new > Kafka cluster. > target cluster > {code} > GROUP TOPICPARTITION CURRENT-OFFSET > LOG-END-OFFSET LAG CONSUMER-ID HOSTCLIENT-ID > test-sync-5 kafka-replication-test-5 0 36373668 > 31 - - - > {code} > source cluster > {code} > GROUP TOPICPARTITION CURRENT-OFFSET > LOG-END-OFFSET LAG CONSUMER-ID HOSTCLIENT-ID > test-sync-5 kafka-replication-test-5 0 36683668 > 0 - - - > {code} > MM2 configuration without connection properties. > {code} > t-kafka->t-extkafka.enabled = true > t-kafka->t-extkafka.topics = ops_filebeat, kafka-replication-.* > t-kafka->t-extkafka.sync.topic.acls.enabled = false > t-kafka->t-extkafka.sync.group.offsets.enabled = true > t-kafka->t-extkafka.sync.group.offsets.interval.seconds = 30 > t-kafka->t-extkafka.refresh.groups.interval.seconds = 30 > t-kafka->t-extkafka.offset-syncs.topic.location = target > t-kafka->t-extkafka.emit.checkpoints.interval.seconds = 30 > t-kafka->t-extkafka.replication.policy.class = > org.apache.kafka.connect.mirror.IdentityReplicationPolicy > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15816) Typos in tests leak network sockets
[ https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15816: Description: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] (DONE) * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] (DONE) * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] (DONE) * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. was: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] (DONE) * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] (DONE) * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. > Typos in tests leak network sockets > --- > > Key: KAFKA-15816 > URL: https://issues.apache.org/jira/browse/KAFKA-15816 > Project: Kafka > Issue Type: Bug > Components: unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > There are a few tests which leak network sockets due to small typos in the > tests themselves. > Clients: [https://github.com/apache/kafka/pull/14750] (DONE) > * NioEchoServer > * KafkaConsumerTest > * KafkaProducerTest > * SelectorTest > * SslTransportLayerTest > * SslTransportTls12Tls13Test > * SslVersionsTransportLayerTest > * SaslAuthenticatorTest > Core: [https://github.com/apache/kafka/pull/14754] > * MiniKdc > * GssapiAuthenticationTest > * MirrorMakerIntegrationTest > * SocketServerTest > * EpochDrivenReplicationProtocolAcceptanceTest > * LeaderEpochIntegrationTest > Trogdor: [https://github.com/apache/kafka/pull/14771] > * AgentTest > Mirror: [https://github.com/apache/kafka/pull/14761] (DONE) > * DedicatedMirrorIntegrationTest > * MirrorConnectorsIntegrationTest > * MirrorConnectorsWithCustomForwardingAdminIntegrationTest > Runtime: [https://github.com/apache/kafka/pull/14764] > * ConnectorTopicsIntegrationTest > * ExactlyOnceSourceIntegrationTest > * WorkerTest > * WorkerGroupMemberTest > Streams: [https://github.com/apache/kafka/pull/14769] (DONE) > * IQv2IntegrationTest > * MetricsReporterIntegrationTest > * NamedTopologyIntegrationTest > * PurgeRepartitionTopicIntegrationTest > These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15816) Typos in tests leak network sockets
[ https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15816: Description: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] (DONE) * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] (DONE) * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. was: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] (DONE) * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. > Typos in tests leak network sockets > --- > > Key: KAFKA-15816 > URL: https://issues.apache.org/jira/browse/KAFKA-15816 > Project: Kafka > Issue Type: Bug > Components: unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > There are a few tests which leak network sockets due to small typos in the > tests themselves. > Clients: [https://github.com/apache/kafka/pull/14750] (DONE) > * NioEchoServer > * KafkaConsumerTest > * KafkaProducerTest > * SelectorTest > * SslTransportLayerTest > * SslTransportTls12Tls13Test > * SslVersionsTransportLayerTest > * SaslAuthenticatorTest > Core: [https://github.com/apache/kafka/pull/14754] > * MiniKdc > * GssapiAuthenticationTest > * MirrorMakerIntegrationTest > * SocketServerTest > * EpochDrivenReplicationProtocolAcceptanceTest > * LeaderEpochIntegrationTest > Trogdor: [https://github.com/apache/kafka/pull/14771] > * AgentTest > Mirror: [https://github.com/apache/kafka/pull/14761] > * DedicatedMirrorIntegrationTest > * MirrorConnectorsIntegrationTest > * MirrorConnectorsWithCustomForwardingAdminIntegrationTest > Runtime: [https://github.com/apache/kafka/pull/14764] > * ConnectorTopicsIntegrationTest > * ExactlyOnceSourceIntegrationTest > * WorkerTest > * WorkerGroupMemberTest > Streams: [https://github.com/apache/kafka/pull/14769] (DONE) > * IQv2IntegrationTest > * MetricsReporterIntegrationTest > * NamedTopologyIntegrationTest > * PurgeRepartitionTopicIntegrationTest > These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15816) Typos in tests leak network sockets
[ https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15816: Description: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] (DONE) * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. was: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. > Typos in tests leak network sockets > --- > > Key: KAFKA-15816 > URL: https://issues.apache.org/jira/browse/KAFKA-15816 > Project: Kafka > Issue Type: Bug > Components: unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > There are a few tests which leak network sockets due to small typos in the > tests themselves. > Clients: [https://github.com/apache/kafka/pull/14750] > * NioEchoServer > * KafkaConsumerTest > * KafkaProducerTest > * SelectorTest > * SslTransportLayerTest > * SslTransportTls12Tls13Test > * SslVersionsTransportLayerTest > * SaslAuthenticatorTest > Core: [https://github.com/apache/kafka/pull/14754] > * MiniKdc > * GssapiAuthenticationTest > * MirrorMakerIntegrationTest > * SocketServerTest > * EpochDrivenReplicationProtocolAcceptanceTest > * LeaderEpochIntegrationTest > Trogdor: [https://github.com/apache/kafka/pull/14771] > * AgentTest > Mirror: [https://github.com/apache/kafka/pull/14761] > * DedicatedMirrorIntegrationTest > * MirrorConnectorsIntegrationTest > * MirrorConnectorsWithCustomForwardingAdminIntegrationTest > Runtime: [https://github.com/apache/kafka/pull/14764] > * ConnectorTopicsIntegrationTest > * ExactlyOnceSourceIntegrationTest > * WorkerTest > * WorkerGroupMemberTest > Streams: [https://github.com/apache/kafka/pull/14769] (DONE) > * IQv2IntegrationTest > * MetricsReporterIntegrationTest > * NamedTopologyIntegrationTest > * PurgeRepartitionTopicIntegrationTest > These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15912) Parallelize conversion and transformation steps in Connect
[ https://issues.apache.org/jira/browse/KAFKA-15912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17790672#comment-17790672 ] Greg Harris commented on KAFKA-15912: - Hey [~mimaison] thanks for the ticket. I've only thought briefly about this and haven't found any obvious blockers, but there are some design restrictions: # Since the javadocs for Transformation and Predicate don't mention thread-safety, I think we have to assume that they are not thread-safe # There is room in the API for a Transformation to be stateful and order-sensitive, (such as packing records together) so I think we would be unable to instantiate multiple copies of a single transform stage, and all records would have to pass serially through a stage. If Transformations and Predicates could declare themselves thread-safe, then we would be able to do some finer-grained parallelism, or fallback to actor-style parallelism (a single thread with message queue input). I think it would be ineffective/undesirable for Transformations to take this performance optimization burden upon themselves completely like the Task implementations do, so we should certainly improve the framework in this area. > Parallelize conversion and transformation steps in Connect > -- > > Key: KAFKA-15912 > URL: https://issues.apache.org/jira/browse/KAFKA-15912 > Project: Kafka > Issue Type: Improvement > Components: connect >Reporter: Mickael Maison >Priority: Major > > In busy Connect pipelines, the conversion and transformation steps can > sometimes have a very significant impact on performance. This is especially > true with large records with complex schemas, for example with CDC connectors > like Debezium. > Today in order to always preserve ordering, converters and transformations > are called on one record at a time in a single thread in the Connect worker. > As Connect usually handles records in batches (up to max.poll.records in sink > pipelines, for source pipelines while it really depends on the connector, > most connectors I've seen still tend to return multiple records each loop), > it could be highly beneficial to attempt running the converters and > transformation chain in parallel by a pool a processing threads. > It should be possible to do some of these steps in parallel and still keep > exact ordering. I'm even considering whether an option to lose ordering but > allow even faster processing would make sense. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15372: Fix Version/s: 3.7.0 > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > Fix For: 3.7.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15906) Emit offset syncs more often than offset.lag.max for low-throughput/finite partitions
[ https://issues.apache.org/jira/browse/KAFKA-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15906: Description: Right now, the offset.lag.max configuration limits the number of offset syncs are emitted by the MirrorSourceTask, along with a fair rate-limiting semaphore. After 100 records have been emitted for a partition, _and_ the semaphore is available, an offset sync can be emitted. For low-volume topics, the `offset.lag.max` default of 100 is much more restrictive than the rate-limiting semaphore. For example, a topic which mirrors at the rate of 1 record/sec may take 100 seconds to receive an offset sync. If the topic is actually finite, the last offset sync will never arrive, and the translation will have a persistent lag. Instead, we can periodically flush the offset syncs for partitions that are under the offset.lag.max limit, but have not received an offset sync recently. This could be a new configuration, be a hard-coded time, or be based on the existing emit.checkpoints.interval.seconds and sync.group.offsets.interval.seconds configurations. Alternatively, we could decrease the default `offset.lag.max` value to 0, and rely on the fair semaphore to limit the number of syncs emitted for high-throughput partitions. The semaphore is not currently configurable, so users wanting lower throughput on the offset-syncs topic will still need an offset.lag.max > 0. was: Right now, the offset.lag.max configuration limits the number of offset syncs are emitted by the MirrorSourceTask, along with a fair rate-limiting semaphore. After 100 records have been emitted for a partition, _and_ the semaphore is available, an offset sync can be emitted. For low-volume topics, the `offset.lag.max` default of 100 is much more restrictive than the rate-limiting semaphore. For example, a topic which mirrors at the rate of 1 record/sec may take 100 seconds to receive an offset sync. If the topic is actually finite, the last offset sync will never arrive, and the translation will have a persistent lag. Instead, we can periodically flush the offset syncs for partitions that are under the offset.lag.max limit, but have not received an offset sync recently. This could be a new configuration, be a hard-coded time, or be based on the existing emit.checkpoints.interval.seconds and sync.group.offsets.interval.seconds configurations. > Emit offset syncs more often than offset.lag.max for low-throughput/finite > partitions > - > > Key: KAFKA-15906 > URL: https://issues.apache.org/jira/browse/KAFKA-15906 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: Greg Harris >Priority: Minor > > Right now, the offset.lag.max configuration limits the number of offset syncs > are emitted by the MirrorSourceTask, along with a fair rate-limiting > semaphore. After 100 records have been emitted for a partition, _and_ the > semaphore is available, an offset sync can be emitted. > For low-volume topics, the `offset.lag.max` default of 100 is much more > restrictive than the rate-limiting semaphore. For example, a topic which > mirrors at the rate of 1 record/sec may take 100 seconds to receive an offset > sync. If the topic is actually finite, the last offset sync will never > arrive, and the translation will have a persistent lag. > Instead, we can periodically flush the offset syncs for partitions that are > under the offset.lag.max limit, but have not received an offset sync > recently. This could be a new configuration, be a hard-coded time, or be > based on the existing emit.checkpoints.interval.seconds and > sync.group.offsets.interval.seconds configurations. > > Alternatively, we could decrease the default `offset.lag.max` value to 0, and > rely on the fair semaphore to limit the number of syncs emitted for > high-throughput partitions. The semaphore is not currently configurable, so > users wanting lower throughput on the offset-syncs topic will still need an > offset.lag.max > 0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15906) Emit offset syncs more often than offset.lag.max for low-throughput/finite partitions
Greg Harris created KAFKA-15906: --- Summary: Emit offset syncs more often than offset.lag.max for low-throughput/finite partitions Key: KAFKA-15906 URL: https://issues.apache.org/jira/browse/KAFKA-15906 Project: Kafka Issue Type: Improvement Components: mirrormaker Reporter: Greg Harris Right now, the offset.lag.max configuration limits the number of offset syncs are emitted by the MirrorSourceTask, along with a fair rate-limiting semaphore. After 100 records have been emitted for a partition, _and_ the semaphore is available, an offset sync can be emitted. For low-volume topics, the `offset.lag.max` default of 100 is much more restrictive than the rate-limiting semaphore. For example, a topic which mirrors at the rate of 1 record/sec may take 100 seconds to receive an offset sync. If the topic is actually finite, the last offset sync will never arrive, and the translation will have a persistent lag. Instead, we can periodically flush the offset syncs for partitions that are under the offset.lag.max limit, but have not received an offset sync recently. This could be a new configuration, be a hard-coded time, or be based on the existing emit.checkpoints.interval.seconds and sync.group.offsets.interval.seconds configurations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15905) Restarts of MirrorCheckpointTask should not permanently interrupt offset translation
Greg Harris created KAFKA-15905: --- Summary: Restarts of MirrorCheckpointTask should not permanently interrupt offset translation Key: KAFKA-15905 URL: https://issues.apache.org/jira/browse/KAFKA-15905 Project: Kafka Issue Type: Improvement Components: mirrormaker Affects Versions: 3.6.0 Reporter: Greg Harris Executive summary: When the MirrorCheckpointTask restarts, it loses the state of checkpointsPerConsumerGroup, which limits offset translation to records mirrored after the latest restart. For example, if 1000 records are mirrored and the OffsetSyncs are read by MirrorCheckpointTask, the emitted checkpoints are cached, and translation can happen at the ~500th record. If MirrorCheckpointTask restarts, and 1000 more records are mirrored, translation can happen at the ~1500th record, but no longer at the ~500th record. Context: Before KAFKA-13659, MM2 made translation decisions based on the incompletely-initialized OffsetSyncStore, and the checkpoint could appear to go backwards temporarily during restarts. To fix this, we forced the OffsetSyncStore to initialize completely before translation could take place, ensuring that the latest OffsetSync had been read, and thus providing the most accurate translation. Before KAFKA-14666, MM2 translated offsets only off of the latest OffsetSync. Afterwards, an in-memory sparse cache of historical OffsetSyncs was kept, to allow for translation of earlier offsets. This came with the caveat that the cache's sparseness allowed translations to go backwards permanently. To prevent this behavior, a cache of the latest Checkpoints was kept in the MirrorCheckpointTask#checkpointsPerConsumerGroup variable, and offset translation remained restricted to the fully-initialized OffsetSyncStore. Effectively, the MirrorCheckpointTask ensures that it translates based on an OffsetSync emitted during it's lifetime, to ensure that no previous MirrorCheckpointTask emitted a later sync. If we can read the checkpoints emitted by previous generations of MirrorCheckpointTask, we can still ensure that checkpoints are monotonic, while allowing translation further back in history. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15862) Remove SecurityManager Support
Greg Harris created KAFKA-15862: --- Summary: Remove SecurityManager Support Key: KAFKA-15862 URL: https://issues.apache.org/jira/browse/KAFKA-15862 Project: Kafka Issue Type: New Feature Components: clients, KafkaConnect, Tiered-Storage Reporter: Greg Harris Assignee: Greg Harris https://cwiki.apache.org/confluence/display/KAFKA/KIP-1006%3A+Remove+SecurityManager+Support -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15816) Typos in tests leak network sockets
[ https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15816: Description: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: [https://github.com/apache/kafka/pull/14750] * NioEchoServer * KafkaConsumerTest * KafkaProducerTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest * SaslAuthenticatorTest Core: [https://github.com/apache/kafka/pull/14754] * MiniKdc * GssapiAuthenticationTest * MirrorMakerIntegrationTest * SocketServerTest * EpochDrivenReplicationProtocolAcceptanceTest * LeaderEpochIntegrationTest Trogdor: [https://github.com/apache/kafka/pull/14771] * AgentTest Mirror: [https://github.com/apache/kafka/pull/14761] * DedicatedMirrorIntegrationTest * MirrorConnectorsIntegrationTest * MirrorConnectorsWithCustomForwardingAdminIntegrationTest Runtime: [https://github.com/apache/kafka/pull/14764] * ConnectorTopicsIntegrationTest * ExactlyOnceSourceIntegrationTest * WorkerTest * WorkerGroupMemberTest Streams: [https://github.com/apache/kafka/pull/14769] * IQv2IntegrationTest * MetricsReporterIntegrationTest * NamedTopologyIntegrationTest * PurgeRepartitionTopicIntegrationTest These can be addressed by just fixing the tests. was: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: https://github.com/apache/kafka/pull/14750 * KafkaConsumerTest * KafkaProducerTest * ConfigResourceTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest Core: * DescribeAuthorizedOperationsTest * SslGssapiSslEndToEndAuthorizationTest * SaslMultiMechanismConsumerTest * SaslPlaintextConsumerTest * SaslSslAdminIntegrationTest * SaslSslConsumerTest * MultipleListenersWithDefaultJaasContextTest * DescribeClusterRequestTest Trogdor: * AgentTest These can be addressed by just fixing the tests. > Typos in tests leak network sockets > --- > > Key: KAFKA-15816 > URL: https://issues.apache.org/jira/browse/KAFKA-15816 > Project: Kafka > Issue Type: Bug > Components: unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > There are a few tests which leak network sockets due to small typos in the > tests themselves. > Clients: [https://github.com/apache/kafka/pull/14750] > * NioEchoServer > * KafkaConsumerTest > * KafkaProducerTest > * SelectorTest > * SslTransportLayerTest > * SslTransportTls12Tls13Test > * SslVersionsTransportLayerTest > * SaslAuthenticatorTest > Core: [https://github.com/apache/kafka/pull/14754] > * MiniKdc > * GssapiAuthenticationTest > * MirrorMakerIntegrationTest > * SocketServerTest > * EpochDrivenReplicationProtocolAcceptanceTest > * LeaderEpochIntegrationTest > Trogdor: [https://github.com/apache/kafka/pull/14771] > * AgentTest > Mirror: [https://github.com/apache/kafka/pull/14761] > * DedicatedMirrorIntegrationTest > * MirrorConnectorsIntegrationTest > * MirrorConnectorsWithCustomForwardingAdminIntegrationTest > Runtime: [https://github.com/apache/kafka/pull/14764] > * ConnectorTopicsIntegrationTest > * ExactlyOnceSourceIntegrationTest > * WorkerTest > * WorkerGroupMemberTest > Streams: [https://github.com/apache/kafka/pull/14769] > * IQv2IntegrationTest > * MetricsReporterIntegrationTest > * NamedTopologyIntegrationTest > * PurgeRepartitionTopicIntegrationTest > These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15845) Add Junit5 test extension which detects leaked Kafka clients and servers
Greg Harris created KAFKA-15845: --- Summary: Add Junit5 test extension which detects leaked Kafka clients and servers Key: KAFKA-15845 URL: https://issues.apache.org/jira/browse/KAFKA-15845 Project: Kafka Issue Type: Improvement Components: clients Reporter: Greg Harris Assignee: Greg Harris We have many tests which accidentally leak Kafka clients and servers. This contributes to test flakiness and build instability. We should use a test extension to make it easier to find these leaked clients and servers, and force test-implementors to resolve their resource leaks prior to merge. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15834) Subscribing to non-existent topic blocks StreamThread from stopping
Greg Harris created KAFKA-15834: --- Summary: Subscribing to non-existent topic blocks StreamThread from stopping Key: KAFKA-15834 URL: https://issues.apache.org/jira/browse/KAFKA-15834 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 3.6.0 Reporter: Greg Harris In NamedTopologyIntegrationTest#shouldContinueProcessingOtherTopologiesWhenNewTopologyHasMissingInputTopics a topology is created which references an input topic which does not exist. The test as-written passes, but the KafkaStreams#close(Duration) at the end times out, and leaves StreamsThreads running. >From some cursory investigation it appears that this is happening: 1. The consumer calls the StreamsPartitionAssignor, which calls TaskManager#handleRebalanceStart as a side-effect 2. handleRebalanceStart sets the rebalanceInProgress flag 3. This flag is checked by StreamThread.runLoop, and causes the loop to remain running. 4. The consumer never calls StreamsRebalanceListener#onPartitionsAssigned, because the topic does not exist 5. Because no partitions are ever assigned, the TaskManager#handleRebalanceComplete never clears the rebalanceInProgress flag This log message is printed in a tight loop while the close is ongoing and the consumer is being polled with zero duration: {noformat} [2023-11-15 11:42:43,661] WARN [Consumer clientId=NamedTopologyIntegrationTestshouldContinueProcessingOtherTopologiesWhenNewTopologyHasMissingInputTopics-942756f8-5213-4c44-bb6b-5f805884e026-StreamThread-1-consumer, groupId=NamedTopologyIntegrationTestshouldContinueProcessingOtherTopologiesWhenNewTopologyHasMissingInputTopics] Received unknown topic or partition error in fetch for partition unique_topic_prefix-topology-1-store-repartition-0 (org.apache.kafka.clients.consumer.internals.FetchCollector:321) {noformat} Practically, this means that this test leaks two StreamsThreads and the associated clients and sockets, and delays the completion of the test until the KafkaStreams#close(Duration) call times out. Either we should change the rebalanceInProgress flag to avoid getting stuck in this rebalance state, or figure out a way to shut down a StreamsThread that is in an extended rebalance state during shutdown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15827) KafkaBasedLog.withExistingClients leaks clients if start is not called
Greg Harris created KAFKA-15827: --- Summary: KafkaBasedLog.withExistingClients leaks clients if start is not called Key: KAFKA-15827 URL: https://issues.apache.org/jira/browse/KAFKA-15827 Project: Kafka Issue Type: Bug Components: KafkaConnect Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris The KafkaBasedLog base implementation creates consumers and producers, and closes them after they are instantiated. There are subclasses of the KafkaBasedLog which accept pre-created consumers and producers, and have the responsibility for closing the clients when the KafkaBasedLog is stopped. It appears that the KafkaBasedLog subclasses do not close the clients when start() is skipped and stop() is called directly. This happens in a few tests, and causes the passed-in clients to be leaked. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15826) WorkerSinkTask leaks Consumer if plugin start or stop blocks indefinitely
Greg Harris created KAFKA-15826: --- Summary: WorkerSinkTask leaks Consumer if plugin start or stop blocks indefinitely Key: KAFKA-15826 URL: https://issues.apache.org/jira/browse/KAFKA-15826 Project: Kafka Issue Type: Bug Components: KafkaConnect Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris The WorkerSourceTask cancel() method closes the Producer, releasing it's resources. The WorkerSInkTask does not do the same for the Consumer, as it does not override the cancel() method. WorkerSinkTask should close the consumer if the task is cancelled, as progress for a cancelled task will be discarded anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15819) KafkaServer leaks KafkaRaftManager when ZK migration enabled
[ https://issues.apache.org/jira/browse/KAFKA-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15819: Priority: Minor (was: Major) > KafkaServer leaks KafkaRaftManager when ZK migration enabled > > > Key: KAFKA-15819 > URL: https://issues.apache.org/jira/browse/KAFKA-15819 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > In SharedServer, TestRaftServer, and MetadataShell, the KafkaRaftManager is > maintained as an instance variable, and shutdown when the outer instance is > shutdown. However, in the KafkaServer, the KafkaRaftManager is instantiated > and started, but then the reference is lost. > [https://github.com/apache/kafka/blob/49d3122d425171b6a59a2b6f02d3fe63d3ac2397/core/src/main/scala/kafka/server/KafkaServer.scala#L416-L442] > Instead, the KafkaServer should behave like the other call-sites of > KafkaRaftManager, and shutdown the KafkaRaftManager during shutdown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15819) KafkaServer leaks KafkaRaftManager when ZK migration enabled
Greg Harris created KAFKA-15819: --- Summary: KafkaServer leaks KafkaRaftManager when ZK migration enabled Key: KAFKA-15819 URL: https://issues.apache.org/jira/browse/KAFKA-15819 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris In SharedServer, TestRaftServer, and MetadataShell, the KafkaRaftManager is maintained as an instance variable, and shutdown when the outer instance is shutdown. However, in the KafkaServer, the KafkaRaftManager is instantiated and started, but then the reference is lost. [https://github.com/apache/kafka/blob/49d3122d425171b6a59a2b6f02d3fe63d3ac2397/core/src/main/scala/kafka/server/KafkaServer.scala#L416-L442] Instead, the KafkaServer should behave like the other call-sites of KafkaRaftManager, and shutdown the KafkaRaftManager during shutdown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15816) Typos in tests leak network sockets
[ https://issues.apache.org/jira/browse/KAFKA-15816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15816: Description: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: https://github.com/apache/kafka/pull/14750 * KafkaConsumerTest * KafkaProducerTest * ConfigResourceTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest Core: * DescribeAuthorizedOperationsTest * SslGssapiSslEndToEndAuthorizationTest * SaslMultiMechanismConsumerTest * SaslPlaintextConsumerTest * SaslSslAdminIntegrationTest * SaslSslConsumerTest * MultipleListenersWithDefaultJaasContextTest * DescribeClusterRequestTest Trogdor: * AgentTest These can be addressed by just fixing the tests. was: There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: * KafkaConsumerTest * KafkaProducerTest * ConfigResourceTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest Core: * DescribeAuthorizedOperationsTest * SslGssapiSslEndToEndAuthorizationTest * SaslMultiMechanismConsumerTest * SaslPlaintextConsumerTest * SaslSslAdminIntegrationTest * SaslSslConsumerTest * MultipleListenersWithDefaultJaasContextTest * DescribeClusterRequestTest Trogdor: * AgentTest These can be addressed by just fixing the tests. > Typos in tests leak network sockets > --- > > Key: KAFKA-15816 > URL: https://issues.apache.org/jira/browse/KAFKA-15816 > Project: Kafka > Issue Type: Bug > Components: unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > There are a few tests which leak network sockets due to small typos in the > tests themselves. > Clients: https://github.com/apache/kafka/pull/14750 > * KafkaConsumerTest > * KafkaProducerTest > * ConfigResourceTest > * SelectorTest > * SslTransportLayerTest > * SslTransportTls12Tls13Test > * SslVersionsTransportLayerTest > Core: > * DescribeAuthorizedOperationsTest > * SslGssapiSslEndToEndAuthorizationTest > * SaslMultiMechanismConsumerTest > * SaslPlaintextConsumerTest > * SaslSslAdminIntegrationTest > * SaslSslConsumerTest > * MultipleListenersWithDefaultJaasContextTest > * DescribeClusterRequestTest > Trogdor: > * AgentTest > These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15816) Typos in tests leak network sockets
Greg Harris created KAFKA-15816: --- Summary: Typos in tests leak network sockets Key: KAFKA-15816 URL: https://issues.apache.org/jira/browse/KAFKA-15816 Project: Kafka Issue Type: Bug Components: unit tests Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris There are a few tests which leak network sockets due to small typos in the tests themselves. Clients: * KafkaConsumerTest * KafkaProducerTest * ConfigResourceTest * SelectorTest * SslTransportLayerTest * SslTransportTls12Tls13Test * SslVersionsTransportLayerTest Core: * DescribeAuthorizedOperationsTest * SslGssapiSslEndToEndAuthorizationTest * SaslMultiMechanismConsumerTest * SaslPlaintextConsumerTest * SaslSslAdminIntegrationTest * SaslSslConsumerTest * MultipleListenersWithDefaultJaasContextTest * DescribeClusterRequestTest Trogdor: * AgentTest These can be addressed by just fixing the tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15815) JsonRestServer leaks sockets via HttpURLConnection when keep-alive enabled
Greg Harris created KAFKA-15815: --- Summary: JsonRestServer leaks sockets via HttpURLConnection when keep-alive enabled Key: KAFKA-15815 URL: https://issues.apache.org/jira/browse/KAFKA-15815 Project: Kafka Issue Type: Bug Affects Versions: 3.6.0 Reporter: Greg Harris By default HttpURLConnection has keep-alive enabled, which allows a single HttpURLConnection to be left open in order to be re-used for later requests. This means that despite JsonRestServer calling `close()` on the relevant InputStream, and calling `disconnect()` on the connection itself, the HttpURLConnection does not call `close()` on the underlying socket. This affects the Trogdor AgentTest and CoordinatorTest suites, where most of the methods make HTTP requests using the JsonRestServer. The effect is that ~32 sockets are leaked per test run, all remaining in the CLOSE_WAIT state (half closed) after the test. This is because the JettyServer has correctly closed the connections, but the HttpURLConnection has not. There does not appear to be a way to locally override the HttpURLConnection's behavior in this case, and only disabling keep-alive overall (via the system property `http.keepAlive=false`) seems to resolve the socket leaks. To prevent the leaks, we can move JsonRestServer to an alternative HTTP implementation, perhaps the jetty-client that Connect uses, or disable keepAlive during tests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15804) Broker leaks ServerSocketChannel when exception is thrown from ZkConfigManager during startup
[ https://issues.apache.org/jira/browse/KAFKA-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784630#comment-17784630 ] Greg Harris commented on KAFKA-15804: - I believe what is happening is: 1. The SocketServer is created 2. The exception is thrown from the dynamicConfigManager 3. SocketServer.enableRequestProcessing is never called, so the SocketServer is never started 4. Because the SocketServer is never started, the SocketServer main runnable never exits, and so the SocketServerChannel is never closed. I think that when the SocketServer beginShutdown()/shutdown() is called without calling enableRequestProcessing() first, the SocketServer should still close these resources. > Broker leaks ServerSocketChannel when exception is thrown from > ZkConfigManager during startup > - > > Key: KAFKA-15804 > URL: https://issues.apache.org/jira/browse/KAFKA-15804 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Priority: Minor > > This exception is thrown during the > RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic > test in zk mode: > {noformat} > org.apache.kafka.common.config.ConfigException: You have to delete all topics > with the property remote.storage.enable=true before disabling tiered storage > cluster-wide > at > org.apache.kafka.storage.internals.log.LogConfig.validateRemoteStorageOnlyIfSystemEnabled(LogConfig.java:566) > at kafka.log.LogManager.updateTopicConfig(LogManager.scala:956) > at > kafka.server.TopicConfigHandler.updateLogConfig(ConfigHandler.scala:73) > at > kafka.server.TopicConfigHandler.processConfigChanges(ConfigHandler.scala:94) > at > kafka.server.ZkConfigManager.$anonfun$startup$4(ZkConfigManager.scala:176) > at > kafka.server.ZkConfigManager.$anonfun$startup$4$adapted(ZkConfigManager.scala:175) > at scala.collection.immutable.Map$Map2.foreach(Map.scala:360) > at > kafka.server.ZkConfigManager.$anonfun$startup$1(ZkConfigManager.scala:175) > at > kafka.server.ZkConfigManager.$anonfun$startup$1$adapted(ZkConfigManager.scala:166) > at scala.collection.immutable.HashMap.foreach(HashMap.scala:1115) > at kafka.server.ZkConfigManager.startup(ZkConfigManager.scala:166) > at kafka.server.KafkaServer.startup(KafkaServer.scala:575) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) > at scala.collection.immutable.List.foreach(List.scala:333) > at > kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) > at > kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) > at > kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) > at > org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) > at > org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) > at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) > at > kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} > This leak only occurs for this one test in the RemoteTopicCrudTest; all other > tests including the kraft-mode version do not exhibit a leaked socket. > Here is where the ServerSocket is instantiated: > {noformat} > at > java.base/java.nio.channels.ServerSocketChannel.open(ServerSocketChannel.java:113) > at kafka.network.Acceptor.openServerSocket(SocketServer.scala:724) > at kafka.network.Acceptor.(SocketServer.scala:608) > at kafka.network.DataPlaneAcceptor.(SocketServer.scala:454) > at > kafka.network.SocketServer.createDataPlaneAcceptor(SocketServer.scala:270) > at > kafka.network.SocketServer.createDataPlaneAcceptorAndProcessors(SocketServer.scala:249) > at kafka.network.SocketServer.$anonfun$new$31(SocketServer.scala:175) > at > kafka.network.SocketServer.$anonfun$new$31$adapted(SocketServer.scala:175) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > at kafka.network.SocketServer.(SocketServer.scala:175) > at kafka.server.KafkaServer.startup(KafkaServer.scala:344) >
[jira] [Updated] (KAFKA-15804) Broker leaks ServerSocketChannel when exception is thrown from ZkConfigManager during startup
[ https://issues.apache.org/jira/browse/KAFKA-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15804: Component/s: (was: Tiered-Storage) > Broker leaks ServerSocketChannel when exception is thrown from > ZkConfigManager during startup > - > > Key: KAFKA-15804 > URL: https://issues.apache.org/jira/browse/KAFKA-15804 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Priority: Minor > > This exception is thrown during the > RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic > test in zk mode: > {noformat} > org.apache.kafka.common.config.ConfigException: You have to delete all topics > with the property remote.storage.enable=true before disabling tiered storage > cluster-wide > at > org.apache.kafka.storage.internals.log.LogConfig.validateRemoteStorageOnlyIfSystemEnabled(LogConfig.java:566) > at kafka.log.LogManager.updateTopicConfig(LogManager.scala:956) > at > kafka.server.TopicConfigHandler.updateLogConfig(ConfigHandler.scala:73) > at > kafka.server.TopicConfigHandler.processConfigChanges(ConfigHandler.scala:94) > at > kafka.server.ZkConfigManager.$anonfun$startup$4(ZkConfigManager.scala:176) > at > kafka.server.ZkConfigManager.$anonfun$startup$4$adapted(ZkConfigManager.scala:175) > at scala.collection.immutable.Map$Map2.foreach(Map.scala:360) > at > kafka.server.ZkConfigManager.$anonfun$startup$1(ZkConfigManager.scala:175) > at > kafka.server.ZkConfigManager.$anonfun$startup$1$adapted(ZkConfigManager.scala:166) > at scala.collection.immutable.HashMap.foreach(HashMap.scala:1115) > at kafka.server.ZkConfigManager.startup(ZkConfigManager.scala:166) > at kafka.server.KafkaServer.startup(KafkaServer.scala:575) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) > at scala.collection.immutable.List.foreach(List.scala:333) > at > kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) > at > kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) > at > kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) > at > org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) > at > org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) > at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) > at > kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} > This leak only occurs for this one test in the RemoteTopicCrudTest; all other > tests including the kraft-mode version do not exhibit a leaked socket. > Here is where the ServerSocket is instantiated: > {noformat} > at > java.base/java.nio.channels.ServerSocketChannel.open(ServerSocketChannel.java:113) > at kafka.network.Acceptor.openServerSocket(SocketServer.scala:724) > at kafka.network.Acceptor.(SocketServer.scala:608) > at kafka.network.DataPlaneAcceptor.(SocketServer.scala:454) > at > kafka.network.SocketServer.createDataPlaneAcceptor(SocketServer.scala:270) > at > kafka.network.SocketServer.createDataPlaneAcceptorAndProcessors(SocketServer.scala:249) > at kafka.network.SocketServer.$anonfun$new$31(SocketServer.scala:175) > at > kafka.network.SocketServer.$anonfun$new$31$adapted(SocketServer.scala:175) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > at kafka.network.SocketServer.(SocketServer.scala:175) > at kafka.server.KafkaServer.startup(KafkaServer.scala:344) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) > at > kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) > at scala.collection.immutable.List.foreach(List.scala:333) > at > kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) > at > kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) > at >
[jira] [Updated] (KAFKA-15804) Broker leaks ServerSocketChannel when exception is thrown from ZkConfigManager during startup
[ https://issues.apache.org/jira/browse/KAFKA-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15804: Description: This exception is thrown during the RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic test in zk mode: {noformat} org.apache.kafka.common.config.ConfigException: You have to delete all topics with the property remote.storage.enable=true before disabling tiered storage cluster-wide at org.apache.kafka.storage.internals.log.LogConfig.validateRemoteStorageOnlyIfSystemEnabled(LogConfig.java:566) at kafka.log.LogManager.updateTopicConfig(LogManager.scala:956) at kafka.server.TopicConfigHandler.updateLogConfig(ConfigHandler.scala:73) at kafka.server.TopicConfigHandler.processConfigChanges(ConfigHandler.scala:94) at kafka.server.ZkConfigManager.$anonfun$startup$4(ZkConfigManager.scala:176) at kafka.server.ZkConfigManager.$anonfun$startup$4$adapted(ZkConfigManager.scala:175) at scala.collection.immutable.Map$Map2.foreach(Map.scala:360) at kafka.server.ZkConfigManager.$anonfun$startup$1(ZkConfigManager.scala:175) at kafka.server.ZkConfigManager.$anonfun$startup$1$adapted(ZkConfigManager.scala:166) at scala.collection.immutable.HashMap.foreach(HashMap.scala:1115) at kafka.server.ZkConfigManager.startup(ZkConfigManager.scala:166) at kafka.server.KafkaServer.startup(KafkaServer.scala:575) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) at scala.collection.immutable.List.foreach(List.scala:333) at kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) at kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) at kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) at kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} This leak only occurs for this one test in the RemoteTopicCrudTest; all other tests including the kraft-mode version do not exhibit a leaked socket. Here is where the ServerSocket is instantiated: {noformat} at java.base/java.nio.channels.ServerSocketChannel.open(ServerSocketChannel.java:113) at kafka.network.Acceptor.openServerSocket(SocketServer.scala:724) at kafka.network.Acceptor.(SocketServer.scala:608) at kafka.network.DataPlaneAcceptor.(SocketServer.scala:454) at kafka.network.SocketServer.createDataPlaneAcceptor(SocketServer.scala:270) at kafka.network.SocketServer.createDataPlaneAcceptorAndProcessors(SocketServer.scala:249) at kafka.network.SocketServer.$anonfun$new$31(SocketServer.scala:175) at kafka.network.SocketServer.$anonfun$new$31$adapted(SocketServer.scala:175) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) at kafka.network.SocketServer.(SocketServer.scala:175) at kafka.server.KafkaServer.startup(KafkaServer.scala:344) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) at scala.collection.immutable.List.foreach(List.scala:333) at kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) at kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) at kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) at kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} And the associated DataPlaneAcceptor: {noformat} at java.base/java.nio.channels.Selector.open(Selector.java:295) at
[jira] [Created] (KAFKA-15804) Broker leaks ServerSocketChannel when exception is thrown from ZkConfigManager during startup
Greg Harris created KAFKA-15804: --- Summary: Broker leaks ServerSocketChannel when exception is thrown from ZkConfigManager during startup Key: KAFKA-15804 URL: https://issues.apache.org/jira/browse/KAFKA-15804 Project: Kafka Issue Type: Bug Components: core, Tiered-Storage, unit tests Affects Versions: 3.6.0 Reporter: Greg Harris This exception is thrown during the RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic test in zk mode: {noformat} org.apache.kafka.common.config.ConfigException: You have to delete all topics with the property remote.storage.enable=true before disabling tiered storage cluster-wide org.apache.kafka.storage.internals.log.LogConfig.validateRemoteStorageOnlyIfSystemEnabled(LogConfig.java:566) kafka.log.LogManager.updateTopicConfig(LogManager.scala:956) kafka.server.TopicConfigHandler.updateLogConfig(ConfigHandler.scala:73) kafka.server.TopicConfigHandler.processConfigChanges(ConfigHandler.scala:94) kafka.server.ZkConfigManager.$anonfun$startup$4(ZkConfigManager.scala:176) kafka.server.ZkConfigManager.$anonfun$startup$4$adapted(ZkConfigManager.scala:175) scala.collection.immutable.Map$Map2.foreach(Map.scala:360) kafka.server.ZkConfigManager.$anonfun$startup$1(ZkConfigManager.scala:175) kafka.server.ZkConfigManager.$anonfun$startup$1$adapted(ZkConfigManager.scala:166) scala.collection.immutable.HashMap.foreach(HashMap.scala:1115) kafka.server.ZkConfigManager.startup(ZkConfigManager.scala:166) kafka.server.KafkaServer.startup(KafkaServer.scala:575) kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) scala.collection.immutable.List.foreach(List.scala:333) kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} This leak only occurs for this one test in the RemoteTopicCrudTest; all other tests including the kraft-mode version do not exhibit a leaked socket. Here is where the ServerSocket is instantiated: {noformat} at java.base/java.nio.channels.ServerSocketChannel.open(ServerSocketChannel.java:113) at kafka.network.Acceptor.openServerSocket(SocketServer.scala:724) at kafka.network.Acceptor.(SocketServer.scala:608) at kafka.network.DataPlaneAcceptor.(SocketServer.scala:454) at kafka.network.SocketServer.createDataPlaneAcceptor(SocketServer.scala:270) at kafka.network.SocketServer.createDataPlaneAcceptorAndProcessors(SocketServer.scala:249) at kafka.network.SocketServer.$anonfun$new$31(SocketServer.scala:175) at kafka.network.SocketServer.$anonfun$new$31$adapted(SocketServer.scala:175) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) at kafka.network.SocketServer.(SocketServer.scala:175) at kafka.server.KafkaServer.startup(KafkaServer.scala:344) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1(KafkaServerTestHarness.scala:356) at kafka.integration.KafkaServerTestHarness.$anonfun$createBrokers$1$adapted(KafkaServerTestHarness.scala:352) at scala.collection.immutable.List.foreach(List.scala:333) at kafka.integration.KafkaServerTestHarness.createBrokers(KafkaServerTestHarness.scala:352) at kafka.integration.KafkaServerTestHarness.recreateBrokers(KafkaServerTestHarness.scala:146) at kafka.admin.RemoteTopicCrudTest.$anonfun$testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic$1(RemoteTopicCrudTest.scala:319) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:53) at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) at kafka.admin.RemoteTopicCrudTest.testClusterWideDisablementOfTieredStorageWithEnabledTieredTopic(RemoteTopicCrudTest.scala:319){noformat} And the associated DataPlaneAcceptor: {noformat} at java.base/java.nio.channels.Selector.open(Selector.java:295) at
[jira] [Commented] (KAFKA-15800) Malformed connect source offsets corrupt other partitions with DataException
[ https://issues.apache.org/jira/browse/KAFKA-15800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784555#comment-17784555 ] Greg Harris commented on KAFKA-15800: - [~showuon] Sorry for not bringing this up in the release thread, that slipped my mind. We should be able to merge this before Tuesday 14th. I'll work with the reviewers to get this moving quickly. > Malformed connect source offsets corrupt other partitions with DataException > > > Key: KAFKA-15800 > URL: https://issues.apache.org/jira/browse/KAFKA-15800 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.5.0, 3.6.0, 3.5.1 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Blocker > Fix For: 3.5.2, 3.7.0, 3.6.1 > > > The KafkaOffsetBackingStore consumer callback was recently augmented with a > call to OffsetUtils.processPartitionKey: > [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L323] > This function deserializes the offset key, which may be malformed in the > topic: > [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetUtils.java#L92] > When this happens, a DataException is thrown, and propagates to the > KafkaBasedLog try-catch surrounding the batch processing of the records: > [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L445-L454] > For example: > {noformat} > ERROR Error polling: org.apache.kafka.connect.errors.DataException: > Converting byte[] to Kafka Connect data failed due to serialization error: > (org.apache.kafka.connect.util.KafkaBasedLog:453){noformat} > This means that one DataException for a malformed record may cause the > remainder of the batch to be dropped, corrupting the in-memory state of the > KafkaOffsetBackingStore. This prevents tasks using the > KafkaOffsetBackingStore from seeing all of the offsets in the topics, and can > cause duplicate records to be emitted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15800) Malformed connect source offsets corrupt other partitions with DataException
Greg Harris created KAFKA-15800: --- Summary: Malformed connect source offsets corrupt other partitions with DataException Key: KAFKA-15800 URL: https://issues.apache.org/jira/browse/KAFKA-15800 Project: Kafka Issue Type: Bug Components: KafkaConnect Affects Versions: 3.5.1, 3.6.0, 3.5.0 Reporter: Greg Harris Assignee: Greg Harris Fix For: 3.5.2, 3.7.0, 3.6.1 The KafkaOffsetBackingStore consumer callback was recently augmented with a call to OffsetUtils.processPartitionKey: [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L323] This function deserializes the offset key, which may be malformed in the topic: [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetUtils.java#L92] When this happens, a DataException is thrown, and propagates to the KafkaBasedLog try-catch surrounding the batch processing of the records: [https://github.com/apache/kafka/blob/f1e58a35d7aebbe72844faf3e5019d9aa7a85e4a/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L445-L454] For example: {noformat} ERROR Error polling: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error: (org.apache.kafka.connect.util.KafkaBasedLog:453){noformat} This means that one DataException for a malformed record may cause the remainder of the batch to be dropped, corrupting the in-memory state of the KafkaOffsetBackingStore. This prevents tasks using the KafkaOffsetBackingStore from seeing all of the offsets in the topics, and can cause duplicate records to be emitted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15564) Kafka 3.5.1- Mirror Maker 2 replicating the wrong consumer group offsets in destination
[ https://issues.apache.org/jira/browse/KAFKA-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779568#comment-17779568 ] Greg Harris commented on KAFKA-15564: - [~hemanthsavasere] Were you able to work around this problem with `offset.lag.max`? > Kafka 3.5.1- Mirror Maker 2 replicating the wrong consumer group offsets in > destination > > > Key: KAFKA-15564 > URL: https://issues.apache.org/jira/browse/KAFKA-15564 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 3.5.1 >Reporter: Hemanth Savasere >Priority: Major > Attachments: Destination Consumer Group Offsets Describe.txt, MM2 > 3.5.1 Logs.txt, Source Consumer Group Offsets Describe.txt, > docker-compose.yml, mm2.properties > > > Issue : Mirror Maker 2 not replicating the proper consumer group offsets when > replication happens from source to destination kafka brokers > Steps to Reproduce : > # Created 2 Kafka clusters locally using the attached docker-compose.yml > file. Was using the confluent platform 7.2.1 docker images > # Had added the below section in docker file to create 20 topics and produce > randomly between 1000 to 2000 messages in each topic, and then consume the > same messages using consumer-groups. > {code:java} > /etc/confluent/docker/run & > sleep 20 > for i in {1..20}; do > kafka-topics --create --bootstrap-server kafka-1:9092 > --replication-factor 1 --partitions 1 --topic topic$$i > done > for i in {1..20}; do > num_msgs=$$(shuf -i 1000-2000 -n 1) > seq 1 $$num_msgs | kafka-console-producer --broker-list > kafka-1:9092 --topic topic$$i > done > for i in {1..20}; do > timeout 10 kafka-console-consumer --bootstrap-server kafka-1:9092 > --topic topic$$i --group consumer-group$$i > done > wait > {code} > # Ran the Mirror Maker 2 using the connect-mirror-maker.sh script after > downloading the Kafka 3.5.1 release binaries. Verified the 3.5.1 version was > running and commitID 2c6fb6c54472e90a was shown in attached MM2 logs file. > NOTE : Have attached the Kafka Docker file, the mirror maker 2 logs and > mirror maker 2 properties file. Also, have attached the describe command > output of all the consumer groups present in source and destination. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14767) Gradle build fails with missing commitId after git gc
[ https://issues.apache.org/jira/browse/KAFKA-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-14767: Fix Version/s: 3.4.2 3.5.2 3.6.1 > Gradle build fails with missing commitId after git gc > - > > Key: KAFKA-14767 > URL: https://issues.apache.org/jira/browse/KAFKA-14767 > Project: Kafka > Issue Type: Bug > Components: build >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > Fix For: 3.4.2, 3.5.2, 3.7.0, 3.6.1 > > > Reproduction steps: > 1. `git gc` > 2. `./gradlew jar` > Expected behavior: build completes successfully (or shows other build errors) > Actual behavior: > {noformat} > Task failed with an exception. > --- > * What went wrong: > A problem was found with the configuration of task > ':storage:createVersionFile' (type 'DefaultTask'). > - Property 'commitId' doesn't have a configured value. > > Reason: This property isn't marked as optional and no value has been > configured. > > Possible solutions: > 1. Assign a value to 'commitId'. > 2. Mark property 'commitId' as optional. > > Please refer to > https://docs.gradle.org/7.6/userguide/validation_problems.html#value_not_set > for more details about this problem.{noformat} > This appears to be due to the fact that the build.gradle determineCommitId() > function is unable to read the git commit hash for the current HEAD. This > appears to happen after a `git gc` takes place, which causes the > `.git/refs/heads/*` files to be moved to `.git/packed-refs`. > The determineCommitId() should be patched to also try reading from the > packed-refs to determine the commit hash. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-10339) MirrorMaker2 Exactly-once Semantics
[ https://issues.apache.org/jira/browse/KAFKA-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776850#comment-17776850 ] Greg Harris commented on KAFKA-10339: - [~funkerman] Hello! In your example, no, because auto-synced offsets is at-least-once. If Consumer A reads and commits A, B, C, they may be re-read by consumer B. The guarantee that "MM2 EOS" provides is that consumer B will read mirrored messages at most once, regardless of what consumer A does. > MirrorMaker2 Exactly-once Semantics > --- > > Key: KAFKA-10339 > URL: https://issues.apache.org/jira/browse/KAFKA-10339 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Reporter: Ning Zhang >Assignee: Ning Zhang >Priority: Major > Labels: needs-kip > Fix For: 3.5.0 > > Attachments: image-2023-10-17-15-33-57-983.png > > > MirrorMaker2 is currently implemented on Kafka Connect Framework, more > specifically the Source Connector / Task, which do not provide exactly-once > semantics (EOS) out-of-the-box, as discussed in > https://github.com/confluentinc/kafka-connect-jdbc/issues/461, > https://github.com/apache/kafka/pull/5553, > https://issues.apache.org/jira/browse/KAFKA-6080 and > https://issues.apache.org/jira/browse/KAFKA-3821. Therefore MirrorMaker2 > currently does not provide EOS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15611) Use virtual threads in the Connect framework
Greg Harris created KAFKA-15611: --- Summary: Use virtual threads in the Connect framework Key: KAFKA-15611 URL: https://issues.apache.org/jira/browse/KAFKA-15611 Project: Kafka Issue Type: Improvement Components: KafkaConnect Reporter: Greg Harris Virtual threads have been finalized in JDK21, so we may include optional support for them in Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15575) Prevent Connectors from exceeding tasks.max configuration
Greg Harris created KAFKA-15575: --- Summary: Prevent Connectors from exceeding tasks.max configuration Key: KAFKA-15575 URL: https://issues.apache.org/jira/browse/KAFKA-15575 Project: Kafka Issue Type: Task Components: KafkaConnect Reporter: Greg Harris The Connector::taskConfigs(int maxTasks) function is used by Connectors to enumerate tasks configurations. This takes an argument which comes from the tasks.max connector config. This is the Javadoc for that method: {noformat} /** * Returns a set of configurations for Tasks based on the current configuration, * producing at most {@code maxTasks} configurations. * * @param maxTasks maximum number of configurations to generate * @return configurations for Tasks */ public abstract List> taskConfigs(int maxTasks); {noformat} This includes the constraint that the number of tasks is at most maxTasks, but this constraint is not enforced by the framework. We should begin enforcing this constraint by dropping configs that exceed the limit, and logging a warning. For sink connectors this should harmlessly rebalance the consumer subscriptions onto the remaining tasks. For source connectors that distribute their work via task configs, this may result in an interruption in data transfer. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15564) Kafka 3.5.1- Mirror Maker 2 replicating the wrong consumer group offsets in destination
[ https://issues.apache.org/jira/browse/KAFKA-15564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772847#comment-17772847 ] Greg Harris commented on KAFKA-15564: - Hi [~hemanthsavasere] the offsets translation behavior was changed in KAFKA-12468 to avoid a data loss bug. You will need to lower the `offset.lag.max` if you want more precise consumer offset translation at the end of the topic. > Kafka 3.5.1- Mirror Maker 2 replicating the wrong consumer group offsets in > destination > > > Key: KAFKA-15564 > URL: https://issues.apache.org/jira/browse/KAFKA-15564 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 3.5.1 >Reporter: Hemanth Savasere >Priority: Major > Attachments: Destination Consumer Group Offsets Describe.txt, MM2 > 3.5.1 Logs.txt, Source Consumer Group Offsets Describe.txt, > docker-compose.yml, mm2.properties > > > Issue : Mirror Maker 2 not replicating the proper consumer group offsets when > replication happens from source to destination kafka brokers > Steps to Reproduce : > # Created 2 Kafka clusters locally using the attached docker-compose.yml > file. Was using the confluent platform 7.2.1 docker images > # Had added the below section in docker file to create 20 topics and produce > randomly between 1000 to 2000 messages in each topic, and then consume the > same messages using consumer-groups. > {code:java} > /etc/confluent/docker/run & > sleep 20 > for i in {1..20}; do > kafka-topics --create --bootstrap-server kafka-1:9092 > --replication-factor 1 --partitions 1 --topic topic$$i > done > for i in {1..20}; do > num_msgs=$$(shuf -i 1000-2000 -n 1) > seq 1 $$num_msgs | kafka-console-producer --broker-list > kafka-1:9092 --topic topic$$i > done > for i in {1..20}; do > timeout 10 kafka-console-consumer --bootstrap-server kafka-1:9092 > --topic topic$$i --group consumer-group$$i > done > wait > {code} > # Ran the Mirror Maker 2 using the connect-mirror-maker.sh script after > downloading the Kafka 3.5.1 release binaries. Verified the 3.5.1 version was > running and commitID 2c6fb6c54472e90a was shown in attached MM2 logs file. > NOTE : Have attached the Kafka Docker file, the mirror maker 2 logs and > mirror maker 2 properties file. Also, have attached the describe command > output of all the consumer groups present in source and destination. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15559) KIP-987: Connect Static Assignments
Greg Harris created KAFKA-15559: --- Summary: KIP-987: Connect Static Assignments Key: KAFKA-15559 URL: https://issues.apache.org/jira/browse/KAFKA-15559 Project: Kafka Issue Type: New Feature Components: KafkaConnect Reporter: Greg Harris Assignee: Greg Harris https://cwiki.apache.org/confluence/display/KAFKA/KIP-987%3A+Connect+Static+Assignments -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15528) KIP-986: Cross-Cluster Replication
Greg Harris created KAFKA-15528: --- Summary: KIP-986: Cross-Cluster Replication Key: KAFKA-15528 URL: https://issues.apache.org/jira/browse/KAFKA-15528 Project: Kafka Issue Type: New Feature Reporter: Greg Harris https://cwiki.apache.org/confluence/display/KAFKA/KIP-986%3A+Cross-Cluster+Replication -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-10339) MirrorMaker2 Exactly-once Semantics
[ https://issues.apache.org/jira/browse/KAFKA-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767334#comment-17767334 ] Greg Harris commented on KAFKA-10339: - Please note that this KAFKA ticket only applies to data replication being exactly once. This means that if you replicate data between two topics, and consume from the target topic from the beginning with read_committed, you will read each record in the source topic exactly once. It does _not_ apply to the MirrorCheckpointTask, offset translation, or consumer group auto sync. These processes are _not_ exactly once, they are at-least-once. This means that if you commit offsets on your upstream topic and then start consuming from the target topic with the translated offsets, you may read the same record more than once. You should read every record at least once. > MirrorMaker2 Exactly-once Semantics > --- > > Key: KAFKA-10339 > URL: https://issues.apache.org/jira/browse/KAFKA-10339 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, mirrormaker >Reporter: Ning Zhang >Assignee: Ning Zhang >Priority: Major > Labels: needs-kip > Fix For: 3.5.0 > > > MirrorMaker2 is currently implemented on Kafka Connect Framework, more > specifically the Source Connector / Task, which do not provide exactly-once > semantics (EOS) out-of-the-box, as discussed in > https://github.com/confluentinc/kafka-connect-jdbc/issues/461, > https://github.com/apache/kafka/pull/5553, > https://issues.apache.org/jira/browse/KAFKA-6080 and > https://issues.apache.org/jira/browse/KAFKA-3821. Therefore MirrorMaker2 > currently does not provide EOS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
[ https://issues.apache.org/jira/browse/KAFKA-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15473: Description: In <3.6.0-rc0, duplicates of a plugin would be shown if it subclassed multiple interfaces. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" },{noformat} In 3.6.0-rc0, there are many more listings for the same plugin. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter", "version": "3.6.0" },{noformat} These duplicates appear to happen when a plugin with the same class name appears in multiple locations/classloaders. When interpreting a connector configuration, only one of these plugins will be chosen, so only one is relevant to show to users. The REST API should only display the plugins which are eligible to be loaded, and hide the duplicates. was: In <3.6.0-rc0, only one copy of each plugin would be shown. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" },{noformat} In 3.6.0-rc0, there are multiple listings for the same plugin. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter", "version": "3.6.0" },{noformat} These duplicates appear to happen when a plugin with the same class name appears in multiple locations/classloaders. When interpreting a connector configuration, only one of these plugins will be chosen, so only one is relevant to show to users. The REST API should only display the plugins which are eligible to be loaded, and hide the duplicates. > Connect connector-plugins endpoint shows duplicate plugins > -- > > Key: KAFKA-15473 > URL: https://issues.apache.org/jira/browse/KAFKA-15473 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > In <3.6.0-rc0, duplicates of a plugin would be shown if it subclassed > multiple interfaces. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > },{noformat} > In 3.6.0-rc0, there are many more listings for the same plugin. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter", > "version": "3.6.0" > },{noformat} > These duplicates appear to happen when a plugin with the same class name > appears in multiple locations/classloaders. > When interpreting a connector configuration, only one of these plugins will > be chosen, so only one is relevant to show to users. The REST API should only > display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
[ https://issues.apache.org/jira/browse/KAFKA-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766595#comment-17766595 ] Greg Harris commented on KAFKA-15473: - I've opened [https://github.com/apache/kafka/pull/14398] with strategy (3) from above. We can always implement (1) in the future and change the PluginInfo::equals implementation to show these duplicates, so we can hide them for now. I think (2) removes functionality from the API and would count as a regression itself. > Connect connector-plugins endpoint shows duplicate plugins > -- > > Key: KAFKA-15473 > URL: https://issues.apache.org/jira/browse/KAFKA-15473 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > In <3.6.0-rc0, only one copy of each plugin would be shown. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > },{noformat} > In 3.6.0-rc0, there are multiple listings for the same plugin. For example: > > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter", > "version": "3.6.0" > },{noformat} > These duplicates appear to happen when a plugin with the same class name > appears in multiple locations/classloaders. > When interpreting a connector configuration, only one of these plugins will > be chosen, so only one is relevant to show to users. The REST API should only > display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
[ https://issues.apache.org/jira/browse/KAFKA-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766592#comment-17766592 ] Greg Harris commented on KAFKA-15473: - Plugins could also appear multiple times <3.6.0-rc0 if multiple versions were on the plugin path concurrently. The DelegatingClassLoader would prefer the one with the latest version, but all of the different versions would be visible in the REST API. It also treated the undefined version as distinct from defined versions. Since 3.6.0 is also the first version with KAFKA-15291 and there are some public plugins which package the AK transforms, there are now going to be both un-versioned and versioned copies of the transforms classes. I think there are multiple ways to address this: 1. Show the source location of the plugins in the REST API to distinguish between apparently equivalent entries 2. Use the DelegatingClassLoader logic for choosing among multiple similarly-named plugins to only show the entry which the DelegatingClassLoader would use 3. Deduplicate the PluginInfos to avoid obvious repetition, but allow multiple versions to still be shown. (1) would require a KIP, and might be scope creep for the REST API. Theoretically the API client shouldn't care about the on-disk layout of plugins. (2) hides the duplicates introduced by KAFKA-15244 and KAFKA-15291, but also hides all but the latest version of each plugin. If someone has multiple versions of a plugin installed, they can currently diagnose that via REST API. After implementing solution (2), they would need to look at the worker logs. (3) hides the duplicates introduced by KAFKA-15244, but leaves the duplicates introduced by KAFKA-15291, and allows users to diagnose when multiple versions are installed. I'm not sure which of these to implement. In <3.6.0-rc0 it was not possible to diagnose installations with multiple copies of the same plugin when they had the same (or undefined) version, and in 3.6.0-rc0 that is now possible. If we implement solution (2) or (3), we would take away that capability. We might just be able to leave this as-is, and try to implement some form of solution (1) to make these duplicates more descriptive. > Connect connector-plugins endpoint shows duplicate plugins > -- > > Key: KAFKA-15473 > URL: https://issues.apache.org/jira/browse/KAFKA-15473 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > In <3.6.0-rc0, only one copy of each plugin would be shown. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > },{noformat} > In 3.6.0-rc0, there are multiple listings for the same plugin. For example: > > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter", > "version": "3.6.0" > },{noformat} > These duplicates appear to happen when a plugin with the same class name > appears in multiple locations/classloaders. > When interpreting a connector configuration, only one of these plugins will > be chosen, so only one is relevant to show to users. The REST API should only > display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
[ https://issues.apache.org/jira/browse/KAFKA-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766578#comment-17766578 ] Greg Harris commented on KAFKA-15473: - it appears that the bug which prompted the fix in KAFKA-15244 (wrong PluginType being inferred) also could cause duplicates. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" },{noformat} Here, the second entry should have been "header_converter". So while there are more duplicates in 3.6.0-rc0 than there were in <3.6.0-rc0, the presence of duplicates is not new. We wouldn't be breaking any third-party clients, because they would already have to handle these duplicates. > Connect connector-plugins endpoint shows duplicate plugins > -- > > Key: KAFKA-15473 > URL: https://issues.apache.org/jira/browse/KAFKA-15473 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Blocker > Fix For: 3.6.0 > > > In <3.6.0-rc0, only one copy of each plugin would be shown. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > },{noformat} > In 3.6.0-rc0, there are multiple listings for the same plugin. For example: > > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter", > "version": "3.6.0" > },{noformat} > These duplicates appear to happen when a plugin with the same class name > appears in multiple locations/classloaders. > When interpreting a connector configuration, only one of these plugins will > be chosen, so only one is relevant to show to users. The REST API should only > display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
[ https://issues.apache.org/jira/browse/KAFKA-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15473: Priority: Major (was: Blocker) > Connect connector-plugins endpoint shows duplicate plugins > -- > > Key: KAFKA-15473 > URL: https://issues.apache.org/jira/browse/KAFKA-15473 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > In <3.6.0-rc0, only one copy of each plugin would be shown. For example: > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > },{noformat} > In 3.6.0-rc0, there are multiple listings for the same plugin. For example: > > {noformat} > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter" > }, > { > "class": "org.apache.kafka.connect.storage.StringConverter", > "type": "converter", > "version": "3.6.0" > },{noformat} > These duplicates appear to happen when a plugin with the same class name > appears in multiple locations/classloaders. > When interpreting a connector configuration, only one of these plugins will > be chosen, so only one is relevant to show to users. The REST API should only > display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15473) Connect connector-plugins endpoint shows duplicate plugins
Greg Harris created KAFKA-15473: --- Summary: Connect connector-plugins endpoint shows duplicate plugins Key: KAFKA-15473 URL: https://issues.apache.org/jira/browse/KAFKA-15473 Project: Kafka Issue Type: Bug Components: KafkaConnect Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris Fix For: 3.6.0 In <3.6.0-rc0, only one copy of each plugin would be shown. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" },{noformat} In 3.6.0-rc0, there are multiple listings for the same plugin. For example: {noformat} { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter" }, { "class": "org.apache.kafka.connect.storage.StringConverter", "type": "converter", "version": "3.6.0" },{noformat} These duplicates appear to happen when a plugin with the same class name appears in multiple locations/classloaders. When interpreting a connector configuration, only one of these plugins will be chosen, so only one is relevant to show to users. The REST API should only display the plugins which are eligible to be loaded, and hide the duplicates. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14595) Move ReassignPartitionsCommand to tools
[ https://issues.apache.org/jira/browse/KAFKA-14595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762822#comment-17762822 ] Greg Harris commented on KAFKA-14595: - I merged [https://github.com/apache/kafka/pull/14217] but I am leaving this open as the other PRs have not landed yet. > Move ReassignPartitionsCommand to tools > --- > > Key: KAFKA-14595 > URL: https://issues.apache.org/jira/browse/KAFKA-14595 > Project: Kafka > Issue Type: Sub-task >Reporter: Mickael Maison >Assignee: Nikolay Izhikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15372: Fix Version/s: (was: 3.6.0) > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761381#comment-17761381 ] Greg Harris commented on KAFKA-15372: - I implemented the REST-less fix, but it's complex enough that I don't feel comfortable including it in the upcoming release. > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris reassigned KAFKA-15372: --- Assignee: Greg Harris > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15211) DistributedConfigTest#shouldFailWithInvalidKeySize fails when run after TestSslUtils#generate
[ https://issues.apache.org/jira/browse/KAFKA-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15211: Fix Version/s: 3.4.2 3.5.2 > DistributedConfigTest#shouldFailWithInvalidKeySize fails when run after > TestSslUtils#generate > - > > Key: KAFKA-15211 > URL: https://issues.apache.org/jira/browse/KAFKA-15211 > Project: Kafka > Issue Type: Test > Components: clients, KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > Fix For: 3.6.0, 3.4.2, 3.5.2 > > > The DistributedConfigTest#shouldFailWithInvalidKeySize attempts to configure > a hashing algorithm with a key size of 0. When run alone, this test passes, > as the default Java hashing algorithm used rejects the key size. > However, when TestSslUtils#generate runs first, such as via the > RestForwardingIntegrationTest, the BouncyCastleProvider is loaded, which > provides an alternative hashing algorithm. This implementation does _not_ > reject the key size, causing the test to fail. > We should ether prevent TestSslUtils#generate from leaving the > BouncyCastleProvider loaded after use, or adjust the test to pass when the > BouncyCastleProvider is loaded. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15377) GET /connectors/{connector}/tasks-config endpoint exposes externalized secret values
[ https://issues.apache.org/jira/browse/KAFKA-15377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15377: Fix Version/s: 3.6.0 3.4.2 3.5.2 > GET /connectors/{connector}/tasks-config endpoint exposes externalized secret > values > > > Key: KAFKA-15377 > URL: https://issues.apache.org/jira/browse/KAFKA-15377 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Yash Mayya >Assignee: Yash Mayya >Priority: Major > Fix For: 3.6.0, 3.4.2, 3.5.2 > > > The {{GET /connectors/\{connector}/tasks-config}} endpoint added in > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-661%3A+Expose+task+configurations+in+Connect+REST+API] > exposes externalized secret values in task configurations (see > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-297%3A+Externalizing+Secrets+for+Connect+Configurations)]. > A similar bug was fixed in https://issues.apache.org/jira/browse/KAFKA-5117 > / [https://github.com/apache/kafka/pull/6129] for the {{GET > /connectors/\{connector}/tasks}} endpoint. The config provider placeholder > should be used instead of the resolved config value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14901) Flaky test ExactlyOnceSourceIntegrationTest.testConnectorReconfiguration
[ https://issues.apache.org/jira/browse/KAFKA-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758769#comment-17758769 ] Greg Harris commented on KAFKA-14901: - This flakiness is no longer present on trunk. I believe that another bug fix resolved the flakiness but I don't know which. > Flaky test ExactlyOnceSourceIntegrationTest.testConnectorReconfiguration > > > Key: KAFKA-14901 > URL: https://issues.apache.org/jira/browse/KAFKA-14901 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Greg Harris >Priority: Major > Labels: flaky-test > Fix For: 3.6.0 > > Attachments: transaction-flake-trace-0.out, transaction-flake.out > > > The EOS Source test appears to be very rarely failing (<5% chance) with the > following error: > {noformat} > org.apache.kafka.common.KafkaException: Unexpected error in > InitProducerIdResponse; The server experienced an unexpected error when > processing the request. > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1303) > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1207) > at > app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) > at > app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:594) > at > app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:586) > at > app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:426) > at > app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316) > at > app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243) > at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829){noformat} > which appears to be triggered by the following failure inside the broker: > {noformat} > [2023-04-12 14:01:38,931] ERROR [KafkaApi-0] Unexpected error handling > request RequestHeader(apiKey=INIT_PRODUCER_ID, apiVersion=4, > clientId=simulated-task-producer-exactlyOnceQuestionMark-1, correlationId=5, > headerVersion=2) -- > InitProducerIdRequestData(transactionalId='exactly-once-source-integration-test-exactlyOnceQuestionMark-1', > transactionTimeoutMs=6, producerId=-1, producerEpoch=-1) with context > RequestContext(header=RequestHeader(apiKey=INIT_PRODUCER_ID, apiVersion=4, > clientId=simulated-task-producer-exactlyOnceQuestionMark-1, correlationId=5, > headerVersion=2), connectionId='127.0.0.1:54213-127.0.0.1:54367-46', > clientAddress=/127.0.0.1, principal=User:ANONYMOUS, > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > clientInformation=ClientInformation(softwareName=apache-kafka-java, > softwareVersion=3.5.0-SNAPSHOT), fromPrivilegedListener=true, > principalSerde=Optional[org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder@615924cd]) > (kafka.server.KafkaApis:76) > java.lang.IllegalStateException: Preparing transaction state transition to > Empty while it already a pending state Ongoing > at > kafka.coordinator.transaction.TransactionMetadata.prepareTransitionTo(TransactionMetadata.scala:380) > at > kafka.coordinator.transaction.TransactionMetadata.prepareIncrementProducerEpoch(TransactionMetadata.scala:311) > at > kafka.coordinator.transaction.TransactionCoordinator.prepareInitProducerIdTransit(TransactionCoordinator.scala:240) > at > kafka.coordinator.transaction.TransactionCoordinator.$anonfun$handleInitProducerId$3(TransactionCoordinator.scala:151) > at > kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:242) > at > kafka.coordinator.transaction.TransactionCoordinator.$anonfun$handleInitProducerId$2(TransactionCoordinator.scala:150) > at scala.util.Either.flatMap(Either.scala:352) > at > kafka.coordinator.transaction.TransactionCoordinator.handleInitProducerId(TransactionCoordinator.scala:145) > at > kafka.server.KafkaApis.handleInitProducerIdRequest(KafkaApis.scala:2236) > at kafka.server.KafkaApis.handle(KafkaApis.scala:202) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76) > at java.base/java.lang.Thread.run(Thread.java:829{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14901) Flaky test ExactlyOnceSourceIntegrationTest.testConnectorReconfiguration
[ https://issues.apache.org/jira/browse/KAFKA-14901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-14901. - Resolution: Cannot Reproduce > Flaky test ExactlyOnceSourceIntegrationTest.testConnectorReconfiguration > > > Key: KAFKA-14901 > URL: https://issues.apache.org/jira/browse/KAFKA-14901 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Greg Harris >Priority: Major > Labels: flaky-test > Fix For: 3.6.0 > > Attachments: transaction-flake-trace-0.out, transaction-flake.out > > > The EOS Source test appears to be very rarely failing (<5% chance) with the > following error: > {noformat} > org.apache.kafka.common.KafkaException: Unexpected error in > InitProducerIdResponse; The server experienced an unexpected error when > processing the request. > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1303) > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1207) > at > app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) > at > app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:594) > at > app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:586) > at > app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:426) > at > app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:316) > at > app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243) > at java.base@11.0.16.1/java.lang.Thread.run(Thread.java:829){noformat} > which appears to be triggered by the following failure inside the broker: > {noformat} > [2023-04-12 14:01:38,931] ERROR [KafkaApi-0] Unexpected error handling > request RequestHeader(apiKey=INIT_PRODUCER_ID, apiVersion=4, > clientId=simulated-task-producer-exactlyOnceQuestionMark-1, correlationId=5, > headerVersion=2) -- > InitProducerIdRequestData(transactionalId='exactly-once-source-integration-test-exactlyOnceQuestionMark-1', > transactionTimeoutMs=6, producerId=-1, producerEpoch=-1) with context > RequestContext(header=RequestHeader(apiKey=INIT_PRODUCER_ID, apiVersion=4, > clientId=simulated-task-producer-exactlyOnceQuestionMark-1, correlationId=5, > headerVersion=2), connectionId='127.0.0.1:54213-127.0.0.1:54367-46', > clientAddress=/127.0.0.1, principal=User:ANONYMOUS, > listenerName=ListenerName(PLAINTEXT), securityProtocol=PLAINTEXT, > clientInformation=ClientInformation(softwareName=apache-kafka-java, > softwareVersion=3.5.0-SNAPSHOT), fromPrivilegedListener=true, > principalSerde=Optional[org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder@615924cd]) > (kafka.server.KafkaApis:76) > java.lang.IllegalStateException: Preparing transaction state transition to > Empty while it already a pending state Ongoing > at > kafka.coordinator.transaction.TransactionMetadata.prepareTransitionTo(TransactionMetadata.scala:380) > at > kafka.coordinator.transaction.TransactionMetadata.prepareIncrementProducerEpoch(TransactionMetadata.scala:311) > at > kafka.coordinator.transaction.TransactionCoordinator.prepareInitProducerIdTransit(TransactionCoordinator.scala:240) > at > kafka.coordinator.transaction.TransactionCoordinator.$anonfun$handleInitProducerId$3(TransactionCoordinator.scala:151) > at > kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:242) > at > kafka.coordinator.transaction.TransactionCoordinator.$anonfun$handleInitProducerId$2(TransactionCoordinator.scala:150) > at scala.util.Either.flatMap(Either.scala:352) > at > kafka.coordinator.transaction.TransactionCoordinator.handleInitProducerId(TransactionCoordinator.scala:145) > at > kafka.server.KafkaApis.handleInitProducerIdRequest(KafkaApis.scala:2236) > at kafka.server.KafkaApis.handle(KafkaApis.scala:202) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76) > at java.base/java.lang.Thread.run(Thread.java:829{noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15197) Flaky test MirrorConnectorsIntegrationExactlyOnceTest.testOffsetTranslationBehindReplicationFlow()
[ https://issues.apache.org/jira/browse/KAFKA-15197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758768#comment-17758768 ] Greg Harris commented on KAFKA-15197: - We merged KAFKA-15202 and the flakiness decreased substantially, with only 2 failures in the past week. I'll continue to monitor this but it shouldn't be a blocker for the upcoming release. > Flaky test > MirrorConnectorsIntegrationExactlyOnceTest.testOffsetTranslationBehindReplicationFlow() > -- > > Key: KAFKA-15197 > URL: https://issues.apache.org/jira/browse/KAFKA-15197 > Project: Kafka > Issue Type: Test > Components: mirrormaker >Reporter: Divij Vaidya >Priority: Major > Labels: flaky-test > Fix For: 3.6.0 > > > As of Jul 17th, this is the second most flaky test in our CI on trunk and > fails 46% of times. > See: > [https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka=Europe/Berlin] > > Note that MirrorConnectorsIntegrationExactlyOnceTest has multiple tests but > testOffsetTranslationBehindReplicationFlow is the one that is the reason for > most failures. see: > [https://ge.apache.org/scans/tests?search.relativeStartTime=P28D=kafka=Europe/Berlin=org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationExactlyOnceTest] > > > Reason for failure is: > |org.opentest4j.AssertionFailedError: Condition not met within timeout 2. > Offsets for consumer group consumer-group-lagging-behind not translated from > primary for topic primary.test-topic-1 ==> expected: but was: | -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15398) Document Connect threat model
Greg Harris created KAFKA-15398: --- Summary: Document Connect threat model Key: KAFKA-15398 URL: https://issues.apache.org/jira/browse/KAFKA-15398 Project: Kafka Issue Type: Task Components: KafkaConnect Reporter: Greg Harris Kafka Connect is a plugin framework, regularly requiring the installation of third-party code. This poses a security hazard for operators, who may be compromised by actively malicious plugins or well-intentioned plugins which are exploitable. We should document the threat model that the Connect architecture uses, and make it clear to operators what trust and verification is required in order to operate Connect safely. At a high level, this documentation may include: # Plugins are arbitrary code with unrestricted access to the filesystem, secrets, and network resources of the hosting Connect worker # The filesystem of the worker is trusted # Connector configurations passed via REST API are trusted # Plugins may have exploits triggered by certain configurations, or by external connections. # Exploits may also be present in plugins/drivers/dependencies used by Connect plugins, such as JDBC drivers # The default installation without REST API security is exploitable when run on an untrusted network. Documenting this security model will also make it easier to discuss changing the model and improving the security architecture of Connect in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15393) MirrorMaker2 integration tests are shutting down uncleanly
Greg Harris created KAFKA-15393: --- Summary: MirrorMaker2 integration tests are shutting down uncleanly Key: KAFKA-15393 URL: https://issues.apache.org/jira/browse/KAFKA-15393 Project: Kafka Issue Type: Test Components: mirrormaker Reporter: Greg Harris Assignee: Greg Harris The MirrorConnectorsBaseIntegrationTest and it's derived test classes often shut down uncleanly. An unclean shutdown takes longer than a clean shutdown, because the shutdown must wait for timeouts to elapse. It appears that during the teardown, the MirrorConnectorsBaseIntegrationTest deletes all of the topics on the EmbeddedKafkaCluster. This causes extremely poor behavior for the connect worker (which is unable to access internal topics) and for the connectors (which get stuck on their final offset commit refreshing metadata and then get cancelled). The EmbeddedKafkaCluster is not reused for the next test, so cleaning the topics is unnecessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15292) Flaky test IdentityReplicationIntegrationTest#testReplicateSourceDefault()
[ https://issues.apache.org/jira/browse/KAFKA-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757734#comment-17757734 ] Greg Harris commented on KAFKA-15292: - The listed exception is shadowing the real exception, which is probably a shutdown timeout. I've opened KAFKA-15392 to resolve the exception shadowing which should hopefully make it easier to diagnose this flakiness. > Flaky test IdentityReplicationIntegrationTest#testReplicateSourceDefault() > -- > > Key: KAFKA-15292 > URL: https://issues.apache.org/jira/browse/KAFKA-15292 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Kirk True >Priority: Major > Labels: flaky-test, mirror-maker, mirrormaker > > The test testReplicateSourceDefault in > `org.apache.kafka.connect.mirror.integration.IdentityReplicationIntegrationTest > is flaky about 2% of the time as shown in [Gradle > Enterprise|[https://ge.apache.org/scans/tests?search.relativeStartTime=P90D=kafka=America/Los_Angeles=org.apache.kafka.connect.mirror.integration.IdentityReplicationIntegrationTest=FLAKY]]. > > {code:java} > java.lang.RuntimeException: Could not stop worker > at > o.a.k.connect.util.clusters.EmbeddedConnectCluster.stopWorker(EmbeddedConnectCluster.java:230) > at java.lang.Iterable.forEach(Iterable.java:75) > at > o.a.k.connect.util.clusters.EmbeddedConnectCluster.stop(EmbeddedConnectCluster.java:163) > at > o.a.k.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.shutdownClusters(MirrorConnectorsIntegrationBaseTest.java:267) > at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:568) > at > org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727) > ... > at > worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74) > Caused by: java.lang.IllegalStateException: !STOPPED > at > org.eclipse.jetty.server.handler.HandlerWrapper.destroy(HandlerWrapper.java:140) > at o.a.k.connect.runtime.rest.RestServer.stop(RestServer.java:361) > at o.a.k.connect.runtime.Connect.stop(Connect.java:69) at > o.a.k.connect.util.clusters.WorkerHandle.stop(WorkerHandle.java:57) > at > o.a.k.connect.util.clusters.EmbeddedConnectCluster.stopWorker(EmbeddedConnectCluster.java:225) > ... 93 more{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15392) RestServer starts but does not stop ServletContextHandler
[ https://issues.apache.org/jira/browse/KAFKA-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15392: Description: Due to the initialization order of the connect RestServer and Herder, the jetty Server is started before the ServletContextHandler instances are installed. This causes jetty to consider them "unmanaged" and thus will not call the start() and stop() lifecycle on our behalf. RestServer#initializeResources already explicitly calls start() for these unmanaged resources, but there is no accompanying stop() call, so the resources never enter the STOPPED state. The jetty server has one more operation after stopping: destroy(), which asserts that resources are already stopped. If the jetty server is ever destroyed, this exception will be thrown: {noformat} java.lang.IllegalStateException: !STOPPED at org.eclipse.jetty.server.handler.HandlerWrapper.destroy(HandlerWrapper.java:140) at o.a.k.connect.runtime.rest.RestServer.stop(RestServer.java:361){noformat} Fortunately, destroy() is currently only called when an error has already occurred, so this IllegalStateException is never thrown on happy-path execution. Instead, if RestServer shutdown encounters an error (such as exceeding the GRACEFUL_SHUTDOWN_TIMEOUT and timing out) the other error will be shadowed by the IllegalStateException. Rather than only calling destroy() on failure and shadowing the error, destroy() should always be called and it's errors reported separately. was: Due to the initialization order of the connect RestServer and Herder, the jetty Server is started before the ServletContextHandler instances are installed. This causes jetty to consider them "unmanaged" and thus will not call the start() and stop() lifecycle on our behalf. RestServer#initializeResources already explicitly calls start() for these unmanaged resources, but there is no accompanying stop() call, so the resources never enter the STOPPED state. The jetty server has one more operation after stopping: destroy(), which asserts that resources are already stopped. If the jetty server is ever destroyed, this exception will be thrown: java.lang.IllegalStateException: !STOPPED at org.eclipse.jetty.server.handler.HandlerWrapper.destroy(HandlerWrapper.java:140) at o.a.k.connect.runtime.rest.RestServer.stop(RestServer.java:361) Fortunately, destroy() is currently only called when an error has already occurred, so this IllegalStateException is never thrown on happy-path execution. Instead, if RestServer shutdown encounters an error (such as exceeding the GRACEFUL_SHUTDOWN_TIMEOUT and timing out) the other error will be shadowed by the IllegalStateException. Rather than only calling destroy() on failure and shadowing the error, destroy() should always be called and it's errors reported separately. > RestServer starts but does not stop ServletContextHandler > - > > Key: KAFKA-15392 > URL: https://issues.apache.org/jira/browse/KAFKA-15392 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > > Due to the initialization order of the connect RestServer and Herder, the > jetty Server is started before the ServletContextHandler instances are > installed. This causes jetty to consider them "unmanaged" and thus will not > call the start() and stop() lifecycle on our behalf. > RestServer#initializeResources already explicitly calls start() for these > unmanaged resources, but there is no accompanying stop() call, so the > resources never enter the STOPPED state. > The jetty server has one more operation after stopping: destroy(), which > asserts that resources are already stopped. If the jetty server is ever > destroyed, this exception will be thrown: > {noformat} > java.lang.IllegalStateException: !STOPPED > at > org.eclipse.jetty.server.handler.HandlerWrapper.destroy(HandlerWrapper.java:140) > at o.a.k.connect.runtime.rest.RestServer.stop(RestServer.java:361){noformat} > Fortunately, destroy() is currently only called when an error has already > occurred, so this IllegalStateException is never thrown on happy-path > execution. Instead, if RestServer shutdown encounters an error (such as > exceeding the GRACEFUL_SHUTDOWN_TIMEOUT and timing out) the other error will > be shadowed by the IllegalStateException. > Rather than only calling destroy() on failure and shadowing the error, > destroy() should always be called and it's errors reported separately. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15392) RestServer starts but does not stop ServletContextHandler
Greg Harris created KAFKA-15392: --- Summary: RestServer starts but does not stop ServletContextHandler Key: KAFKA-15392 URL: https://issues.apache.org/jira/browse/KAFKA-15392 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Greg Harris Assignee: Greg Harris Due to the initialization order of the connect RestServer and Herder, the jetty Server is started before the ServletContextHandler instances are installed. This causes jetty to consider them "unmanaged" and thus will not call the start() and stop() lifecycle on our behalf. RestServer#initializeResources already explicitly calls start() for these unmanaged resources, but there is no accompanying stop() call, so the resources never enter the STOPPED state. The jetty server has one more operation after stopping: destroy(), which asserts that resources are already stopped. If the jetty server is ever destroyed, this exception will be thrown: java.lang.IllegalStateException: !STOPPED at org.eclipse.jetty.server.handler.HandlerWrapper.destroy(HandlerWrapper.java:140) at o.a.k.connect.runtime.rest.RestServer.stop(RestServer.java:361) Fortunately, destroy() is currently only called when an error has already occurred, so this IllegalStateException is never thrown on happy-path execution. Instead, if RestServer shutdown encounters an error (such as exceeding the GRACEFUL_SHUTDOWN_TIMEOUT and timing out) the other error will be shadowed by the IllegalStateException. Rather than only calling destroy() on failure and shadowing the error, destroy() should always be called and it's errors reported separately. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757542#comment-17757542 ] Greg Harris commented on KAFKA-15372: - Reading through the KIP-710 thread, it appears that the public API was intentionally left out of the implementation to avoid security problems. In particular, the connector config endpoint security is typically managed by a rest extension and by default is unsecured, while the internal endpoints are already secured by default. If we were to keep the current security posture of KIP-710 (all endpoints secured) and add a new endpoint, we would need to secure it by default. This would be a divergence between MM2 dedicated mode and normal Connect Distributed mode that the KIP discussion wanted to avoid. Because of that, I don't think we can implement connector config forwarding in a bugfix, it's going to at least need a slightly longer discussion about securing the original endpoint vs duplicating the endpoint. I think that means that we'll need to implement the alternative solution, which delays applying the configuration until the worker becomes the leader. > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14627) Modernize Connect plugin discovery
[ https://issues.apache.org/jira/browse/KAFKA-14627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-14627. - Resolution: Fixed > Modernize Connect plugin discovery > -- > > Key: KAFKA-14627 > URL: https://issues.apache.org/jira/browse/KAFKA-14627 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-898%3A+Modernize+Connect+plugin+discovery -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15226) System tests for plugin.discovery worker configuration
[ https://issues.apache.org/jira/browse/KAFKA-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-15226. - Fix Version/s: 3.6.0 Resolution: Fixed > System tests for plugin.discovery worker configuration > -- > > Key: KAFKA-15226 > URL: https://issues.apache.org/jira/browse/KAFKA-15226 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > > Add system tests as described in KIP-898, targeting the startup behavior of > the connect worker, various states of plugin migration, and the migration > script. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756062#comment-17756062 ] Greg Harris commented on KAFKA-15372: - Here's a proof-of-concept for the fix: [https://github.com/apache/kafka/pull/14245] > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756048#comment-17756048 ] Greg Harris edited comment on KAFKA-15372 at 8/18/23 4:38 PM: -- [~durban] thank you for the sanity check, I see the flaw now. The NotLeaderException is created/thrown in the Herder, and the caller (typically the ConnectorsResource) is responsible for handling the error and forwarding to the leader. See HerderRequestHandler#completeOrForwardRequest. I thought the MirrorMaker class used this handler for applying the connector configuration on startup, but it doesn't. It never triggers the forwarding logic that normal REST requests have. More importantly, KIP-710 only added {_}internal rest resources{_}, which are for writing task configurations and fencing zombies. It didn't include the connector configuration endpoint, which is typically public-api. This was an oversight in the KIP-710 design/implementation, and should be fixed. Currently, KIP-710 only benefits the internal herder requests for writing task configurations and fencing zombies, it is not capable of forwarding connector configurations. Since KIP-710 already added the necessary configuration options to enable/disable the internal REST, and this arguably should have been covered by the original implementation, I think we can address this in a bug-fix. Alternatively, we could have MirrorMaker periodically apply its local configurations, retrying until it receives leadership, and then stopping. This wouldn't require changes to the REST API, and would cover some additional error scenarios, but would possibly cause configurations to flap between versions as leadership changes. To workaround this issue, users will need to fully stop and fully start MM2, to allow the first node to apply configurations successfully. was (Author: gharris1727): [~durban] thank you for the sanity check, I see the flaw now. The NotLeaderException is created/thrown in the Herder, and the caller (typically the ConnectorsResource) is responsible for handling the error and forwarding to the leader. See HerderRequestHandler#completeOrForwardRequest. I thought the MirrorMaker class used this handler for applying the connector configuration on startup, but it doesn't. It never triggers the forwarding logic that normal REST requests have. More importantly, KIP-710 only added {_}internal rest resources{_}, which are for writing task configurations and fencing zombies. It didn't include the connector configuration endpoint, which is typically public-api. This was an oversight in the KIP-710 design/implementation, and should be fixed. Currently, KIP-710 only benefits the internal herder requests for writing task configurations and fencing zombies, it is not capable of forwarding connector configurations. Since KIP-710 already added the necessary configuration options to enable/disable the internal REST, and this arguably should have been covered by the original implementation, I think we can address this in a bug-fix. To workaround this issue, users will need to fully stop and fully start MM2, to allow the first node to apply configurations successfully. > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15372: Issue Type: Bug (was: Improvement) > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15372: Fix Version/s: 3.6.0 > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > Fix For: 3.6.0 > > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756048#comment-17756048 ] Greg Harris commented on KAFKA-15372: - [~durban] thank you for the sanity check, I see the flaw now. The NotLeaderException is created/thrown in the Herder, and the caller (typically the ConnectorsResource) is responsible for handling the error and forwarding to the leader. See HerderRequestHandler#completeOrForwardRequest. I thought the MirrorMaker class used this handler for applying the connector configuration on startup, but it doesn't. It never triggers the forwarding logic that normal REST requests have. More importantly, KIP-710 only added {_}internal rest resources{_}, which are for writing task configurations and fencing zombies. It didn't include the connector configuration endpoint, which is typically public-api. This was an oversight in the KIP-710 design/implementation, and should be fixed. Currently, KIP-710 only benefits the internal herder requests for writing task configurations and fencing zombies, it is not capable of forwarding connector configurations. Since KIP-710 already added the necessary configuration options to enable/disable the internal REST, and this arguably should have been covered by the original implementation, I think we can address this in a bug-fix. To workaround this issue, users will need to fully stop and fully start MM2, to allow the first node to apply configurations successfully. > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15227) Use plugin.discovery=SERVICE_LOAD in all plugin test suites
[ https://issues.apache.org/jira/browse/KAFKA-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755658#comment-17755658 ] Greg Harris commented on KAFKA-15227: - We can delay this for some time, as the performance improvements are marginal in AK. It will also catch any non-migrated plugins added in PRs. > Use plugin.discovery=SERVICE_LOAD in all plugin test suites > --- > > Key: KAFKA-15227 > URL: https://issues.apache.org/jira/browse/KAFKA-15227 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Greg Harris >Priority: Major > > To speed up these tests where we know all plugins are migrated, use > SERVICE_LOAD mode in all known test suites. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15227) Use plugin.discovery=SERVICE_LOAD in all plugin test suites
[ https://issues.apache.org/jira/browse/KAFKA-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris reassigned KAFKA-15227: --- Assignee: (was: Greg Harris) > Use plugin.discovery=SERVICE_LOAD in all plugin test suites > --- > > Key: KAFKA-15227 > URL: https://issues.apache.org/jira/browse/KAFKA-15227 > Project: Kafka > Issue Type: Test > Components: KafkaConnect >Reporter: Greg Harris >Priority: Major > > To speed up these tests where we know all plugins are migrated, use > SERVICE_LOAD mode in all known test suites. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755647#comment-17755647 ] Greg Harris commented on KAFKA-15372: - [~durban] I should clarify, I believe the behavior you described is the expected (but undesirable) behavior for versions before 3.5.0 and earlier, and for 3.5.0+ with the configuration set to the default `false`. When the internal REST API is enabled, the worker which is starting (that is not the leader) should forward configurations to the leader via the internal REST API. If you're not seeing the REST forwarding being attempted, or the forwarding failing, that would explain why configuration updates are being dropped. If the above does not apply, then I'm interested to hear more details of your reproduction case. Thanks! > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15372) MM2 rolling restart can drop configuration changes silently
[ https://issues.apache.org/jira/browse/KAFKA-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755629#comment-17755629 ] Greg Harris commented on KAFKA-15372: - Hi [~durban] Thanks for the bug report. Is this reproducible with 3.5.0 and `dedicated.mode.enable.internal.rest` set to `true`? This configuration was added in [https://cwiki.apache.org/confluence/display/KAFKA/KIP-710%3A+Full+support+for+distributed+mode+in+dedicated+MirrorMaker+2.0+clusters] . > MM2 rolling restart can drop configuration changes silently > --- > > Key: KAFKA-15372 > URL: https://issues.apache.org/jira/browse/KAFKA-15372 > Project: Kafka > Issue Type: Improvement > Components: mirrormaker >Reporter: Daniel Urban >Priority: Major > > When MM2 is restarted, it tries to update the Connector configuration in all > flows. This is a one-time trial, and fails if the Connect worker is not the > leader of the group. > In a distributed setup and with a rolling restart, it is possible that for a > specific flow, the Connect worker of the just restarted MM2 instance is not > the leader, meaning that Connector configurations can get dropped. > For example, assuming 2 MM2 instances, and one flow A->B: > # MM2 instance 1 is restarted, the worker inside MM2 instance 2 becomes the > leader of A->B Connect group. > # MM2 instance 1 tries to update the Connector configurations, but fails > (instance 2 has the leader, not instance 1) > # MM2 instance 2 is restarted, leadership moves to worker in MM2 instance 1 > # MM2 instance 2 tries to update the Connector configurations, but fails > At this point, the configuration changes before the restart are never > applied. Many times, this can also happen silently, without any indication. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15228) Add sync-manifests subcommand to connect-plugin-path tool
[ https://issues.apache.org/jira/browse/KAFKA-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-15228. - Resolution: Fixed > Add sync-manifests subcommand to connect-plugin-path tool > - > > Key: KAFKA-15228 > URL: https://issues.apache.org/jira/browse/KAFKA-15228 > Project: Kafka > Issue Type: New Feature > Components: KafkaConnect, tools >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > Fix For: 3.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15350) MetadataLoaderMetrics has ClassNotFoundException in system tests
[ https://issues.apache.org/jira/browse/KAFKA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-15350. - Resolution: Invalid I believe this was an environmental problem, due to some stale artifacts. A clean build fixed the issues I was seeing. > MetadataLoaderMetrics has ClassNotFoundException in system tests > > > Key: KAFKA-15350 > URL: https://issues.apache.org/jira/browse/KAFKA-15350 > Project: Kafka > Issue Type: Bug > Components: kraft, system tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Priority: Blocker > > The system tests appear to be failing on trunk with: > {noformat} > [2023-08-15 22:11:34,235] ERROR Exiting Kafka due to fatal exception > (kafka.Kafka$) > java.lang.NoClassDefFoundError: > org/apache/kafka/image/loader/MetadataLoaderMetrics >at kafka.server.KafkaRaftServer.(KafkaRaftServer.scala:68) >at kafka.Kafka$.buildServer(Kafka.scala:83) >at kafka.Kafka$.main(Kafka.scala:91) >at kafka.Kafka.main(Kafka.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.image.loader.MetadataLoaderMetrics >at java.net.URLClassLoader.findClass(URLClassLoader.java:387) >at java.lang.ClassLoader.loadClass(ClassLoader.java:418) >at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) >at java.lang.ClassLoader.loadClass(ClassLoader.java:351) >... 4 more{noformat} > This happens with the `tests/kafkatest/tests/connect/`, > `tests/kafkatest/tests/core/kraft_upgrade_test.py` and possibly others. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15350) MetadataLoaderMetrics has ClassNotFoundException in system tests
[ https://issues.apache.org/jira/browse/KAFKA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15350: Issue Type: Bug (was: Improvement) > MetadataLoaderMetrics has ClassNotFoundException in system tests > > > Key: KAFKA-15350 > URL: https://issues.apache.org/jira/browse/KAFKA-15350 > Project: Kafka > Issue Type: Bug > Components: kraft, system tests >Affects Versions: 3.6.0 >Reporter: Greg Harris >Priority: Blocker > > The system tests appear to be failing on trunk with: > {noformat} > [2023-08-15 22:11:34,235] ERROR Exiting Kafka due to fatal exception > (kafka.Kafka$) > java.lang.NoClassDefFoundError: > org/apache/kafka/image/loader/MetadataLoaderMetrics >at kafka.server.KafkaRaftServer.(KafkaRaftServer.scala:68) >at kafka.Kafka$.buildServer(Kafka.scala:83) >at kafka.Kafka$.main(Kafka.scala:91) >at kafka.Kafka.main(Kafka.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.kafka.image.loader.MetadataLoaderMetrics >at java.net.URLClassLoader.findClass(URLClassLoader.java:387) >at java.lang.ClassLoader.loadClass(ClassLoader.java:418) >at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) >at java.lang.ClassLoader.loadClass(ClassLoader.java:351) >... 4 more{noformat} > This happens with the `tests/kafkatest/tests/connect/`, > `tests/kafkatest/tests/core/kraft_upgrade_test.py` and possibly others. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15336) Connect plugin Javadocs should mention serviceloader manifests
[ https://issues.apache.org/jira/browse/KAFKA-15336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15336: Issue Type: Improvement (was: Bug) > Connect plugin Javadocs should mention serviceloader manifests > -- > > Key: KAFKA-15336 > URL: https://issues.apache.org/jira/browse/KAFKA-15336 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > Fix For: 3.6.0 > > > Similar to the ConfigProvider, the Javadocs for the Connect plugin classes > should mention that plugin implementations should have ServiceLoader > manifests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15336) Connect plugin Javadocs should mention serviceloader manifests
[ https://issues.apache.org/jira/browse/KAFKA-15336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-15336: Issue Type: Bug (was: Improvement) > Connect plugin Javadocs should mention serviceloader manifests > -- > > Key: KAFKA-15336 > URL: https://issues.apache.org/jira/browse/KAFKA-15336 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 3.6.0 >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Minor > Fix For: 3.6.0 > > > Similar to the ConfigProvider, the Javadocs for the Connect plugin classes > should mention that plugin implementations should have ServiceLoader > manifests. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15350) MetadataLoaderMetrics has ClassNotFoundException in system tests
Greg Harris created KAFKA-15350: --- Summary: MetadataLoaderMetrics has ClassNotFoundException in system tests Key: KAFKA-15350 URL: https://issues.apache.org/jira/browse/KAFKA-15350 Project: Kafka Issue Type: Improvement Components: kraft, system tests Affects Versions: 3.6.0 Reporter: Greg Harris The system tests appear to be failing on trunk with: {noformat} [2023-08-15 22:11:34,235] ERROR Exiting Kafka due to fatal exception (kafka.Kafka$) java.lang.NoClassDefFoundError: org/apache/kafka/image/loader/MetadataLoaderMetrics at kafka.server.KafkaRaftServer.(KafkaRaftServer.scala:68) at kafka.Kafka$.buildServer(Kafka.scala:83) at kafka.Kafka$.main(Kafka.scala:91) at kafka.Kafka.main(Kafka.scala) Caused by: java.lang.ClassNotFoundException: org.apache.kafka.image.loader.MetadataLoaderMetrics at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 4 more{noformat} This happens with the `tests/kafkatest/tests/connect/`, `tests/kafkatest/tests/core/kraft_upgrade_test.py` and possibly others. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15349) ducker-ak should fail fast when gradlew systemTestLibs fails
Greg Harris created KAFKA-15349: --- Summary: ducker-ak should fail fast when gradlew systemTestLibs fails Key: KAFKA-15349 URL: https://issues.apache.org/jira/browse/KAFKA-15349 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Greg Harris If you introduce a flaw into the gradle build which causes the systemTestLibs to fail, such as a circular dependency, then the ducker_test function continues to run tests which are invalid. Rather than proceeding to run the tests, the script should fail fast and make the user address the error before continuing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15343) Fix MirrorConnectIntegrationTests causing ci build failures.
[ https://issues.apache.org/jira/browse/KAFKA-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754771#comment-17754771 ] Greg Harris commented on KAFKA-15343: - Hi [~prasanth] and thank you for reporting this issue! It is certainly not good that one test can cause the whole build to fail, preventing other tests from running. Can you speak to the frequency that you've seen this failure? Naively I would expect that with >1 ephemeral ports available, that such a failure would be quite rare. If this is true, I don't think it is appropriate to disable these tests. They are extremely important test coverage for the MirrorMaker2 feature, and disabling them may lead to undetected regressions. As far as resolving this issue, I think we should: 1. Find where we are leaking Kafka clients in the MM2 integration test suites, either within the framework or within the Mirror connectors. 2. Close Kafka clients in a timely fashion (some relevant work in https://issues.apache.org/jira/browse/KAFKA-14725 and https://issues.apache.org/jira/browse/KAFKA-15090 ) 2. Try to reproduce the Gradle daemon crash in a more controlled environment 3. Report the daemon crash to the Gradle upstream Since random port selection and port-reuse are standard procedures (not specific to Kafka) there could be downstream projects using Gradle that are affected. If there is something specific about the Kafka clients' connections that affect gradle, then we should investigate further to help the Gradle project resolve the issue. > Fix MirrorConnectIntegrationTests causing ci build failures. > > > Key: KAFKA-15343 > URL: https://issues.apache.org/jira/browse/KAFKA-15343 > Project: Kafka > Issue Type: Bug > Components: build >Affects Versions: 3.6.0 >Reporter: Prasanth Kumar >Priority: Major > > There are several instances of tests interacting badly with gradle daemon(s) > running on ports that the kafka broker previously used. After going through > the debug logs we observed a few retrying kafka clients trying to connect to > broker which got shutdown and the gradle worker chose the same port on which > broker was running. Later in the build, the gradle daemon attempted to > connect to the worker and could not, triggering a failure. Ideally gradle > would not exit when connected to from an invalid client - in testing with > netcat, it would often handle these without dying. However there appear to be > some cases where the daemon dies completely. Both the broker code and the > gradle workers bind to port 0, resulting in the OS assigning it an unused > port. This does avoid conflicts, but does not ensure that long lived clients > do not attempt to connect to these ports afterwards. It's possible that > closing the client in between may be enough to work around this issue. Till > then we will disable the test to avoid the ci blocker from testing the code > changes. > *MirrorConnectorsIntegrationBaseTest and extending Tests* > {code:java} > [2023-07-04T11:48:16.128Z] 2023-07-04T11:47:46.804+ [DEBUG] > [TestEventLogger] > MirrorConnectorsWithCustomForwardingAdminIntegrationTest > > testReplicateSourceDefault() STANDARD_OUT > [2023-07-04T11:48:16.128Z] 2023-07-04T11:47:46.804+ [DEBUG] > [TestEventLogger] [2023-07-04 11:47:46,799] > INFO primary REST service: http://localhost:43809/connectors > (org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest:224) > [2023-07-04T11:48:16.128Z] 2023-07-04T11:47:46.804+ [DEBUG] > [TestEventLogger] [2023-07-04 11:47:46,799] > INFO backup REST service: http://localhost:43323/connectors > (org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest:225) > [2023-07-04T11:48:16.128Z] 2023-07-04T11:47:46.804+ [DEBUG] > [TestEventLogger] [2023-07-04 11:47:46,799] > INFO primary brokers: localhost:37557 > (org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest:226) > [2023-07-04T11:59:12.968Z] 2023-07-04T11:59:12.900+ [DEBUG] > [org.gradle.internal.remote.internal.inet.TcpIncomingConnector] > Accepted connection from /127.0.0.1:47660 to /127.0.0.1:37557. > [2023-07-04T11:59:13.233Z] > org.gradle.internal.remote.internal.MessageIOException: Could not read > message from '/127.0.0.1:47660'. > [2023-07-04T11:59:12.970Z] 2023-07-04T11:59:12.579+ [DEBUG] > [org.gradle.internal.remote.internal.inet.TcpIncomingConnector] Listening on > [d6bf30cb-bca2-46d9-8aeb-b9fd0497f54d port:37557, > addresses:[localhost/127.0.0.1]]. > [2023-07-04T11:59:46.519Z] 2023-07-04T11:59:13.014+ [ERROR] > [system.err] org.gradle.internal.remote.internal.ConnectException: Could not > connect to server [d6bf30cb-bca2-46d9-8aeb-b9fd0497f54d port:37557, >
[jira] [Created] (KAFKA-15336) Connect plugin Javadocs should mention serviceloader manifests
Greg Harris created KAFKA-15336: --- Summary: Connect plugin Javadocs should mention serviceloader manifests Key: KAFKA-15336 URL: https://issues.apache.org/jira/browse/KAFKA-15336 Project: Kafka Issue Type: Improvement Components: KafkaConnect Affects Versions: 3.6.0 Reporter: Greg Harris Assignee: Greg Harris Similar to the ConfigProvider, the Javadocs for the Connect plugin classes should mention that plugin implementations should have ServiceLoader manifests. -- This message was sent by Atlassian Jira (v8.20.10#820010)