[jira] [Created] (KAFKA-16056) Worker poll timeout expiry can lead to Duplicate task assignments.
Sagar Rao created KAFKA-16056: - Summary: Worker poll timeout expiry can lead to Duplicate task assignments. Key: KAFKA-16056 URL: https://issues.apache.org/jira/browse/KAFKA-16056 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Sagar Rao Assignee: Sagar Rao When a poll timeout expiry happens for a worker, it triggers a rebalance because it leaves the group pro-actively. Under normal scenarios, this leaving the group would trigger a scheduled rebalance delay. However, one thing to note is that, the worker which left the group temporarily, doesn't give up it's assignments and whatever tasks were running on it would remain as is. When the scheduled rebalance delay elapses, it would just get back it's assignments but given that there won't be any revocations, it should all work out fine. But there is an edge case here. Let's assume that a scheduled rebalance delay was already active on a group and just before a follow up rebalance due to scheduled rebalance elapsing, one of the worker's poll timeout expires. At this point, a rebalance is imminent and the leader would track the assignments of the transiently departed worker as lost [here|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L255] . When [handleLostAssignments|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L441] gets triggered, because the scheduledRebalance delay isn't reset yet and if [this|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L473] condition passes, the leader would assume that it needs to reassign all the lost assignments which it will. But because, the worker for which the poll timeout expired, doesn't rescind it's assignments we would end up noticing duplicate assignments- one set on the original worker which was already running the tasks and connectors and another set on the remaining group of workers which got the redistributed work. This could lead to task failures if connector has been written in a way which expects no duplicate tasks running across a cluster. Also, this edge case can be encountered more frequently if the `rebalance.timeout.ms` config is set to a lower value. One of the approaches could be to do something similar to https://issues.apache.org/jira/browse/KAFKA-9184 where upon coordinator discovery failure, the worker gives up all it's assignments and joins with an empty assignment. We could do something similar in this case as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]
sciclon2 commented on code in PR #15030: URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364 ## tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java: ## @@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws Exception { props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig())); } props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, commandOptions.getBootstrapServer()); -props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); -props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); +if (!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); +} +if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); Review Comment: @divijvaidya , I will need to convert the timeoutMs.toMillis() to INT as the return value is long, this comes from [here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66) I will convert it in Int then like this `props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Integer.toString((int)timeoutMs.toMillis()));` agree? The change will be for both configs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] KAFKA-15556: Remove NetworkClientDelegate methods isUnavailable, maybeThrowAuthFailure, and tryConnect [kafka]
Phuc-Hong-Tran commented on PR #15020: URL: https://github.com/apache/kafka/pull/15020#issuecomment-1870808021 @philipnee @kirktrue, Please take a look if you guys have some free time. Thanks in advance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (KAFKA-15948) Refactor AsyncKafkaConsumer shutdown
[ https://issues.apache.org/jira/browse/KAFKA-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Nee resolved KAFKA-15948. Assignee: Philip Nee (was: Phuc Hong Tran) Resolution: Fixed > Refactor AsyncKafkaConsumer shutdown > > > Key: KAFKA-15948 > URL: https://issues.apache.org/jira/browse/KAFKA-15948 > Project: Kafka > Issue Type: Improvement > Components: clients, consumer >Reporter: Philip Nee >Assignee: Philip Nee >Priority: Major > Labels: consumer-threading-refactor > Fix For: 3.8.0 > > > Upon closing we need a round trip from the network thread to the application > thread and then back to the network thread to complete the callback > invocation. Currently, we don't have any of that. I think we need to > refactor our closing mechanism. There are a few points to the refactor: > # The network thread should know if there's a custom user callback to > trigger or not. If there is, it should wait for the callback completion to > send a leave group. If not, it should proceed with the shutdown. > # The application thread sends a closing signal to the network thread and > continuously polls the background event handler until time runs out. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15948) Refactor AsyncKafkaConsumer shutdown
[ https://issues.apache.org/jira/browse/KAFKA-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phuc Hong Tran reassigned KAFKA-15948: -- Assignee: Phuc Hong Tran > Refactor AsyncKafkaConsumer shutdown > > > Key: KAFKA-15948 > URL: https://issues.apache.org/jira/browse/KAFKA-15948 > Project: Kafka > Issue Type: Improvement > Components: clients, consumer >Reporter: Philip Nee >Assignee: Phuc Hong Tran >Priority: Major > Labels: consumer-threading-refactor > Fix For: 3.8.0 > > > Upon closing we need a round trip from the network thread to the application > thread and then back to the network thread to complete the callback > invocation. Currently, we don't have any of that. I think we need to > refactor our closing mechanism. There are a few points to the refactor: > # The network thread should know if there's a custom user callback to > trigger or not. If there is, it should wait for the callback completion to > send a leave group. If not, it should proceed with the shutdown. > # The application thread sends a closing signal to the network thread and > continuously polls the background event handler until time runs out. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] KAFKA-15344: Commit leader epoch where possible [kafka]
github-actions[bot] commented on PR #14454: URL: https://github.com/apache/kafka/pull/14454#issuecomment-1870790976 This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: Range assignor changes [kafka]
github-actions[bot] commented on PR #14345: URL: https://github.com/apache/kafka/pull/14345#issuecomment-1870791020 This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437350376 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: I've updated the PR. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437350213 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2639,59 +2639,65 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), Review Comment: I agree we should not change any test semantics here. I updated the PR to close the new created `quotaManager` explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
satishd commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437344252 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -167,7 +167,7 @@ class ReplicaManagerTest { .foreach(checkpointFile => assertTrue(Files.exists(checkpointFile), s"checkpoint file does not exist at $checkpointFile")) } finally { - rm.shutdown(checkpointHW = false) + CoreUtils.swallow(rm.shutdown(checkpointHW = false), this) Review Comment: Why is the exception swallowed here and all other places with the latest change? I did not mean to suggest this in my review. I clarified that in another [comment](https://github.com/apache/kafka/pull/15077#discussion_r1437341409). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800911#comment-17800911 ] Justine Olshan commented on KAFKA-16052: What do we think about lowering the number of sequences we test (say from 10 to 3 or 4) and the number of groups from 10 to 5 or so. That would lower the number of operations from 35,000 to 7,000. This is used in both the GroupCoordinator and the TransactionCoordinator test too. (The transaction coordinator doesn't have groups, but transactions). We could also look at some of the other tests in the suite. (To be honest, I never really understood how this suite was actually testing concurrency with all the mocks and redefined operations) We probably need to figure out why we store all of these mock invocations since we are just executing the methods anyway. But I think lowering the number of operations should give us some breathing room for the OOMs. What do you think [~divijvaidya] [~ijuma] > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
satishd commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437341409 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: Right @showuon , what I meant here was try closing the second one even if the earlier invocation has thrown an error to avoid leaks caused by not closing the second instance. We should not change any of the other places to swallow the exceptions where there is only one single invocation of shutting down of replicamanager or logmanager. One way to do that is what @divijvaidya suggested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800909#comment-17800909 ] Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:35 AM: -- So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. For the testConcurrentRandomSequence test, we call these methods approximately 35,000 times. We do 10 random sequences and 10 ordered sequences. Each sequence consists of 7 operations for the 250 group members (nthreads(5) * 5 * 10). So 20 * 7 * 250 is 35,000. Not sure if we need this many operations! was (Author: jolshan): So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800909#comment-17800909 ] Justine Olshan commented on KAFKA-16052: So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800903#comment-17800903 ] Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:05 AM: -- It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition. This is overridden in TestReplicaManager to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. was (Author: jolshan): It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition there. This is overridden to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800903#comment-17800903 ] Justine Olshan commented on KAFKA-16052: It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition there. This is overridden to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800902#comment-17800902 ] Justine Olshan commented on KAFKA-16052: Thanks Divij. This might be a few things then since the handleTxnCommitOffsets is not in the TransactionCoordinator test. I will look into this though. I believe the OOMs started before I changed the transactional offset commit flow, but perhaps it is still related to something else I changed. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15997) Ensure fairness in the uniform assignor
[ https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900 ] Ritika Reddy edited comment on KAFKA-15997 at 12/28/23 1:41 AM: Hey, from what I understand you are trying to test the scenario where 8 members are subscribed to the same topic with 16 partitions. We want to test whether each member gets a balanced allocation which would be two partitions each in this case. I have a few follow up questions: * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 partitions? * Is this for the first assignment or is this testing a rebalance scenario after some subscription changes like the test name suggests? If it's the later, what changes were made to trigger a rebalance? * Is this topic the only subscription or are there other topics that the members are subscribed to? was (Author: JIRAUSER300287): Hey, from what I understand you are trying to test the scenario where 8 members are subscribed to the same topic with 16 partitions. We want to test whether each member gets a balanced allocation which would be two partitions each in this case. I have a few follow up questions: * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 partitions? * Is this for the first assignment or is this testing a rebalance scenario after some subscription changes like the test name suggests? * Is this topic the only subscription or are there other topics that the members are subscribed to? > Ensure fairness in the uniform assignor > --- > > Key: KAFKA-15997 > URL: https://issues.apache.org/jira/browse/KAFKA-15997 > Project: Kafka > Issue Type: Sub-task >Reporter: Emanuele Sabellico >Assignee: Ritika Reddy >Priority: Minor > > > > Fairness has to be ensured in uniform assignor as it was in > cooperative-sticky one. > There's this test 0113 subtest u_multiple_subscription_changes in librdkafka > where 8 consumers are subscribing to the same topic, and it's verifying that > all of them are getting 2 partitions assigned. But with new protocol it seems > two consumers get assigned 3 partitions and 1 has zero partitions. The test > doesn't configure any client.rack. > {code:java} > [0113_cooperative_rebalance /478.183s] Consumer assignments > (subscription_variation 0) (stabilized) (no rebalance cb): > [0113_cooperative_rebalance /478.183s] Consumer C_0#consumer-3 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs) > [0113_cooperative_rebalance /478.183s] Consumer C_1#consumer-4 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 assignment > (0): > [0113_cooperative_rebalance /478.184s] Consumer C_7#consumer-10 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs) > [0113_cooperative_rebalance /478.184s] 16/32 partitions assigned > [0113_cooperative_rebalance /478.184s] Consumer C_0#consumer-3 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_1#consumer-4 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2
[jira] [Commented] (KAFKA-15997) Ensure fairness in the uniform assignor
[ https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800901#comment-17800901 ] Ritika Reddy commented on KAFKA-15997: -- In case of the first assignment, I have written a test for this scenario and got a balanced assignment. public void testFirstAssignmentEightMembersOneTopicNoMemberRacks() { Map topicMetadata = new HashMap<>(); topicMetadata.put(topic1Uuid, new TopicMetadata( topic1Uuid, topic1Name, 16, mkMapOfPartitionRacks(3) )); Map members = new TreeMap<>(); for (int i = 0 ; i < 8 ; i++) { members.put("member" + i, new AssignmentMemberSpec( Optional.empty(), Optional.empty(), Arrays.asList(topic1Uuid), Collections.emptyMap() )); } AssignmentSpec assignmentSpec = new AssignmentSpec(members); SubscribedTopicMetadata subscribedTopicMetadata = new SubscribedTopicMetadata(topicMetadata); GroupAssignment computedAssignment = assignor.assign(assignmentSpec, subscribedTopicMetadata); System.out.println(computedAssignment); computedAssignment.members().forEach((member, assignment) -> assertEquals(2, assignment.targetPartitions().get(topic1Uuid).size())); checkValidityAndBalance(members, computedAssignment); } > Ensure fairness in the uniform assignor > --- > > Key: KAFKA-15997 > URL: https://issues.apache.org/jira/browse/KAFKA-15997 > Project: Kafka > Issue Type: Sub-task >Reporter: Emanuele Sabellico >Assignee: Ritika Reddy >Priority: Minor > > > > Fairness has to be ensured in uniform assignor as it was in > cooperative-sticky one. > There's this test 0113 subtest u_multiple_subscription_changes in librdkafka > where 8 consumers are subscribing to the same topic, and it's verifying that > all of them are getting 2 partitions assigned. But with new protocol it seems > two consumers get assigned 3 partitions and 1 has zero partitions. The test > doesn't configure any client.rack. > {code:java} > [0113_cooperative_rebalance /478.183s] Consumer assignments > (subscription_variation 0) (stabilized) (no rebalance cb): > [0113_cooperative_rebalance /478.183s] Consumer C_0#consumer-3 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs) > [0113_cooperative_rebalance /478.183s] Consumer C_1#consumer-4 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 assignment > (0): > [0113_cooperative_rebalance /478.184s] Consumer C_7#consumer-10 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs) > [0113_cooperative_rebalance /478.184s] 16/32 partitions assigned > [0113_cooperative_rebalance /478.184s] Consumer C_0#consumer-3 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_1#consumer-4 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 has 0 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.1
[jira] [Comment Edited] (KAFKA-15997) Ensure fairness in the uniform assignor
[ https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900 ] Ritika Reddy edited comment on KAFKA-15997 at 12/28/23 1:39 AM: Hey, from what I understand you are trying to test the scenario where 8 members are subscribed to the same topic with 16 partitions. We want to test whether each member gets a balanced allocation which would be two partitions each in this case. I have a few follow up questions: * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 partitions? * Is this for the first assignment or is this testing a rebalance scenario after some subscription changes like the test name suggests? * Is this topic the only subscription or are there other topics that the members are subscribed to? was (Author: JIRAUSER300287): Hey, from what I understand you are trying to test the scenario where 8 members are subscribed to the same topic with 16 partitions. We want to test whether each member gets a balanced allocation which would be two partitions each in this case. I have a few follow up questions: * 16/32 partitions assigned > Ensure fairness in the uniform assignor > --- > > Key: KAFKA-15997 > URL: https://issues.apache.org/jira/browse/KAFKA-15997 > Project: Kafka > Issue Type: Sub-task >Reporter: Emanuele Sabellico >Assignee: Ritika Reddy >Priority: Minor > > > > Fairness has to be ensured in uniform assignor as it was in > cooperative-sticky one. > There's this test 0113 subtest u_multiple_subscription_changes in librdkafka > where 8 consumers are subscribing to the same topic, and it's verifying that > all of them are getting 2 partitions assigned. But with new protocol it seems > two consumers get assigned 3 partitions and 1 has zero partitions. The test > doesn't configure any client.rack. > {code:java} > [0113_cooperative_rebalance /478.183s] Consumer assignments > (subscription_variation 0) (stabilized) (no rebalance cb): > [0113_cooperative_rebalance /478.183s] Consumer C_0#consumer-3 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs) > [0113_cooperative_rebalance /478.183s] Consumer C_1#consumer-4 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 assignment > (0): > [0113_cooperative_rebalance /478.184s] Consumer C_7#consumer-10 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs) > [0113_cooperative_rebalance /478.184s] 16/32 partitions assigned > [0113_cooperative_rebalance /478.184s] Consumer C_0#consumer-3 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_1#consumer-4 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 has 0 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s]
[jira] [Commented] (KAFKA-15997) Ensure fairness in the uniform assignor
[ https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900 ] Ritika Reddy commented on KAFKA-15997: -- Hey, from what I understand you are trying to test the scenario where 8 members are subscribed to the same topic with 16 partitions. We want to test whether each member gets a balanced allocation which would be two partitions each in this case. I have a few follow up questions: * 16/32 partitions assigned > Ensure fairness in the uniform assignor > --- > > Key: KAFKA-15997 > URL: https://issues.apache.org/jira/browse/KAFKA-15997 > Project: Kafka > Issue Type: Sub-task >Reporter: Emanuele Sabellico >Assignee: Ritika Reddy >Priority: Minor > > > > Fairness has to be ensured in uniform assignor as it was in > cooperative-sticky one. > There's this test 0113 subtest u_multiple_subscription_changes in librdkafka > where 8 consumers are subscribing to the same topic, and it's verifying that > all of them are getting 2 partitions assigned. But with new protocol it seems > two consumers get assigned 3 partitions and 1 has zero partitions. The test > doesn't configure any client.rack. > {code:java} > [0113_cooperative_rebalance /478.183s] Consumer assignments > (subscription_variation 0) (stabilized) (no rebalance cb): > [0113_cooperative_rebalance /478.183s] Consumer C_0#consumer-3 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs) > [0113_cooperative_rebalance /478.183s] Consumer C_1#consumer-4 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 assignment > (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs) > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 assignment > (0): > [0113_cooperative_rebalance /478.184s] Consumer C_7#consumer-10 assignment > (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), > rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs) > [0113_cooperative_rebalance /478.184s] 16/32 partitions assigned > [0113_cooperative_rebalance /478.184s] Consumer C_0#consumer-3 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_1#consumer-4 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_2#consumer-5 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_3#consumer-6 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_4#consumer-7 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_5#consumer-8 has 3 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_6#consumer-9 has 0 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [0113_cooperative_rebalance /478.184s] Consumer C_7#consumer-10 has 2 > assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions > [ /479.057s] 1 test(s) running: > 0113_cooperative_rebalance > [ /480.057s] 1 test(s) running: > 0113_cooperative_rebalance > [ /481.057s] 1 test(s) running: > 0113_cooperative_rebalance > [0113_cooperative_rebalance /482.498s] TEST FAILURE > ### Test "0113_cooperative_rebalance (u_multiple_subscription_changes:2390: > use_rebalance_cb: 0, subscription_variation: 0)" failed at > test.c:1243:check_test_timeouts() at Thu Dec 7 15:52:15 2023: ### > Test 0113_cooperative_rebalance (u_multiple_subscription_chang
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800893#comment-17800893 ] Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 11:23 PM: - Also, when we look at what is inside the InterceptedInvocation objects, it hints at " groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", producerId, producerEpoch, JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, JoinGroupRequest.UNKNOWN_GENERATION_ID, offsets, callbackWithTxnCompletion)" line. Note that there are ~700L invocations in mockito for this. (You can also play with the heap dump I linked above to find this information) !Screenshot 2023-12-28 at 00.18.56.png! was (Author: divijvaidya): Also, when we look at what is inside the InterceptedInvocation objects, it hints at " groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", producerId, producerEpoch, JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, JoinGroupRequest.UNKNOWN_GENERATION_ID, offsets, callbackWithTxnCompletion)" line. Note that there are ~700L invocations in mockito for this. !Screenshot 2023-12-28 at 00.18.56.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800893#comment-17800893 ] Divij Vaidya commented on KAFKA-16052: -- Also, when we look at what is inside the InterceptedInvocation objects, it hints at " groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", producerId, producerEpoch, JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, JoinGroupRequest.UNKNOWN_GENERATION_ID, offsets, callbackWithTxnCompletion)" line. Note that there are ~700L invocations in mockito for this. !Screenshot 2023-12-28 at 00.18.56.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-28 at 00.18.56.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800892#comment-17800892 ] Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 11:19 PM: - I don't think that clearing the mocks is helping here. I ran "./gradlew :core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap dump. The test spikes to take 778MB heap at peak. Here's the heap dump [https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0] For reference, this is how I was clearing the mocks {code:java} @@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends AbstractCoordinatorConcurrencyTest @AfterEach override def tearDown(): Unit = { try { - if (groupCoordinator != null) - groupCoordinator.shutdown() + CoreUtils.swallow(groupCoordinator.shutdown(), this) + CoreUtils.swallow(metrics.close(), this) } finally { super.tearDown() } + Mockito.framework().clearInlineMocks() } {code} !Screenshot 2023-12-28 at 00.13.06.png! was (Author: divijvaidya): I don't think that clearing the mocks is helping here. I ran "./gradlew :core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap dump. The test spikes to take 778MB heap at peak. Here's the heap dump [https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0] For reference, this is how I was clearing the mocks {code:java} @@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends AbstractCoordinatorConcurrencyTest @AfterEach override def tearDown(): Unit = { try { - if (groupCoordinator != null) - groupCoordinator.shutdown() + CoreUtils.swallow(groupCoordinator.shutdown(), this) + CoreUtils.swallow(metrics.close(), this) } finally { super.tearDown() } + Mockito.framework().clearInlineMocks() } {code} !Screenshot 2023-12-28 at 00.13.06.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800892#comment-17800892 ] Divij Vaidya commented on KAFKA-16052: -- I don't think that clearing the mocks is helping here. I ran "./gradlew :core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap dump. The test spikes to take 778MB heap at peak. Here's the heap dump [https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0] For reference, this is how I was clearing the mocks {code:java} @@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends AbstractCoordinatorConcurrencyTest @AfterEach override def tearDown(): Unit = { try { - if (groupCoordinator != null) - groupCoordinator.shutdown() + CoreUtils.swallow(groupCoordinator.shutdown(), this) + CoreUtils.swallow(metrics.close(), this) } finally { super.tearDown() } + Mockito.framework().clearInlineMocks() } {code} !Screenshot 2023-12-28 at 00.13.06.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-28 at 00.13.06.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16055) Thread unsafe use of HashMap stored in QueryableStoreProvider#storeProviders
Kohei Nozaki created KAFKA-16055: Summary: Thread unsafe use of HashMap stored in QueryableStoreProvider#storeProviders Key: KAFKA-16055 URL: https://issues.apache.org/jira/browse/KAFKA-16055 Project: Kafka Issue Type: Bug Affects Versions: 3.6.1 Reporter: Kohei Nozaki This was originally raised in [a kafka-users post|https://lists.apache.org/thread/gpct1275bfqovlckptn3lvf683qpoxol]. There is a HashMap stored in QueryableStoreProvider#storeProviders ([code link|https://github.com/apache/kafka/blob/3.6.1/streams/src/main/java/org/apache/kafka/streams/state/internals/QueryableStoreProvider.java#L39]) which can be mutated by a KafkaStreams#removeStreamThread() call. This can be problematic when KafkaStreams#store is called from a separate thread. We need to somehow make this part of code thread-safe by replacing it by ConcurrentHashMap or/and using an existing locking mechanism. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] Duplicate method; The QuotaUtils one is used. [kafka]
jolshan commented on PR #15066: URL: https://github.com/apache/kafka/pull/15066#issuecomment-1870661247 @afshing -- can you link the commit that moved the code in your description? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] KAFKA-16045 Fix flaky testMigrateTopicDeletions [kafka]
mumrah opened a new pull request, #15082: URL: https://github.com/apache/kafka/pull/15082 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800880#comment-17800880 ] Justine Olshan commented on KAFKA-16052: There are only four tests in the group coordinator suite, so it is surprising to hit OOM on those. But trying the clearInlineMocks should be a low hanging fruit that could help a bit while we figure out why the mocks are causing so many issues in the first place. I have a draft PR with the change and we can look at how the builds do: https://github.com/apache/kafka/pull/15081 > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] WIP -- clear mocks in AbstractCoordinatorConcurrencyTest [kafka]
jolshan opened a new pull request, #15081: URL: https://github.com/apache/kafka/pull/15081 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800877#comment-17800877 ] Justine Olshan commented on KAFKA-16052: Ah – good point [~ijuma]. I will look into that as well. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800875#comment-17800875 ] Ismael Juma commented on KAFKA-16052: - It can be done via Mockito.framework().clearInlineMocks(). > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800874#comment-17800874 ] Ismael Juma commented on KAFKA-16052: - Have you tried clearing the mocks during tearDown? > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800865#comment-17800865 ] Justine Olshan commented on KAFKA-16052: Taking a look at removing the mock as well. There are quite a few parts of the initialization that we don't need to do for the tests. For example: val replicaFetcherManager = createReplicaFetcherManager(metrics, time, threadNamePrefix, quotaManagers.follower) private[server] val replicaAlterLogDirsManager = createReplicaAlterLogDirsManager(quotaManagers.alterLogDirs, brokerTopicStats) We can do these if necessary, but it also starts other threads that we probably don't want running for the test. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
[ https://issues.apache.org/jira/browse/KAFKA-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Arthur reassigned KAFKA-16045: Assignee: David Arthur > ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky > - > > Key: KAFKA-16045 > URL: https://issues.apache.org/jira/browse/KAFKA-16045 > Project: Kafka > Issue Type: Test >Reporter: Justine Olshan >Assignee: David Arthur >Priority: Major > Labels: flaky-test > > I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many > builds. I believe it is also causing a thread leak because on most runs where > it fails I also see ReplicaManager tests also fail with extra threads. > The test always fails > `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be > deleted` > gradle enterprise link: > [https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTim[…]lues=trunk&tests.container=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.container=kafka.zk.ZkMigrationIntegrationTest] > recent pr: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/] > trunk builds: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests], > > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests] > (edited) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855 ] Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:20 PM: -- Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. was (Author: jolshan): Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not sure if this is something else. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855 ] Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:19 PM: -- Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not sure if this is something else. was (Author: jolshan): Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855 ] Justine Olshan commented on KAFKA-16052: Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800853#comment-17800853 ] Divij Vaidya commented on KAFKA-16052: -- Yes Justine, that is my current line of thought as well. The mock of TestReplicaManager is having many invocations. I am now trying to directly use TestReplicaManager instead of the mock in the test. I think that we are mocking it just to avoid calling the constructor of replicamanager (so that we can use our default purgatory). If you can get that working, it would be awesome! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800852#comment-17800852 ] Justine Olshan commented on KAFKA-16052: I can also take a look at the heap dump if that is helpful. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800851#comment-17800851 ] Justine Olshan commented on KAFKA-16052: Thanks Divij for the digging. These tests share the common AbstractCoordinatorCouncurrencyTest. I wonder if the mock in question is there. We do mock replica manager, but in an interesting way replicaManager = mock(classOf[TestReplicaManager], withSettings().defaultAnswer(CALLS_REAL_METHODS)) > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] KAFKA-16047: Leverage the fenceProducers timeout in the InitProducerId [kafka]
gharris1727 commented on code in PR #15078: URL: https://github.com/apache/kafka/pull/15078#discussion_r1437200184 ## clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java: ## @@ -82,9 +86,10 @@ InitProducerIdRequest.Builder buildSingleRequest(int brokerId, CoordinatorKey ke .setProducerEpoch(ProducerIdAndEpoch.NONE.epoch) .setProducerId(ProducerIdAndEpoch.NONE.producerId) .setTransactionalId(key.idValue) -// Set transaction timeout to 1 since it's only being initialized to fence out older producers with the same transactional ID, -// and shouldn't be used for any actual record writes -.setTransactionTimeoutMs(1); +// Set transaction timeout to the equivalent as the fenceProducers request since it's only being initialized to fence out older producers with the same transactional ID, +// and shouldn't be used for any actual record writes. This has been changed to match the fenceProducers request timeout from one as some brokers may be slower than expected +// and we need a safe timeout that allows the transaction init to finish. +.setTransactionTimeoutMs(this.options.timeoutMs()); Review Comment: The timeoutMs is nullable, and is always null when using the default FenceProducersOptions. When it is null we should use the KafkaAdminClient#defaultApiTimeoutMs, similar to deleteRecords or calcDeadlineMs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800845#comment-17800845 ] Ron Dagostino commented on KAFKA-15495: --- Thanks, [~jsancio]. I've updated the title and description to make it clear this is a general problem as opposed to being KRaft-specific, and I've indicated it affects all released versions back to 1.0.0. I've also linked it to the ELR ticket at https://issues.apache.org/jira/browse/KAFKA-15332. > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, > 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, > 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, > 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, > 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition has just a single leader replica in the ISR. Assume > next that this replica goes offline. This replica's log will define the > contents of that partition when the replica restarts, which is correct > behavior. However, assume now that the replica has a disk failure, and we > then replace the failed disk with a new, empty disk that we also format with > the storage tool so it has the correct cluster ID. If we then restart the > broker, the topic-partition will have no data in it, and any other replicas > that might exist will truncate their logs to match, which results in data > loss. See below for a step-by-step demo of how to reproduce this using KRaft > (the issue impacts ZK-based implementations as well, but we supply only a > KRaft-based reproduce case here): > Note that implementing Eligible leader Replicas > (https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --execute > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --verify > #make preferred leader 11 the actual leader if it not > bin/kafka-leader-election.sh --bootstrap-server localhos
[jira] [Commented] (KAFKA-14132) Remaining PowerMock to Mockito tests
[ https://issues.apache.org/jira/browse/KAFKA-14132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800833#comment-17800833 ] Christo Lolov commented on KAFKA-14132: --- Heya [~enether], thanks for checking in on this one! I need to do another pass to see whether everything with PowerMock has indeed been moved - I still have a couple of classes with EasyMock which need to be moved and given the timeline that won't happen for 3.7. What is left here, which I haven't documented, is that the PowerMock dependency needs to be removed from Kafka and the build script needs to be changed! > Remaining PowerMock to Mockito tests > > > Key: KAFKA-14132 > URL: https://issues.apache.org/jira/browse/KAFKA-14132 > Project: Kafka > Issue Type: Sub-task >Reporter: Christo Lolov >Assignee: Christo Lolov >Priority: Major > Fix For: 3.8.0 > > > {color:#de350b}Some of the tests below use EasyMock as well. For those > migrate both PowerMock and EasyMock to Mockito.{color} > Unless stated in brackets the tests are in the connect module. > A list of tests which still require to be moved from PowerMock to Mockito as > of 2nd of August 2022 which do not have a Jira issue and do not have pull > requests I am aware of which are opened: > {color:#ff8b00}InReview{color} > {color:#00875a}Merged{color} > # {color:#00875a}ErrorHandlingTaskTest{color} (owner: [~shekharrajak]) > # {color:#00875a}SourceTaskOffsetCommiterTest{color} (owner: Christo) > # {color:#00875a}WorkerMetricsGroupTest{color} (owner: Divij) > # {color:#00875a}WorkerTaskTest{color} (owner: [~yash.mayya]) > # {color:#00875a}ErrorReporterTest{color} (owner: [~yash.mayya]) > # {color:#00875a}RetryWithToleranceOperatorTest{color} (owner: [~yash.mayya]) > # {color:#00875a}WorkerErrantRecordReporterTest{color} (owner: [~yash.mayya]) > # {color:#00875a}ConnectorsResourceTest{color} (owner: [~mdedetrich-aiven]) > # {color:#00875a}StandaloneHerderTest{color} (owner: [~mdedetrich-aiven]) > # KafkaConfigBackingStoreTest (owner: [~bachmanity1]) > # {color:#00875a}KafkaOffsetBackingStoreTest{color} (owner: Christo) > ([https://github.com/apache/kafka/pull/12418]) > # {color:#00875a}KafkaBasedLogTest{color} (owner: @bachmanity ]) > # {color:#00875a}RetryUtilTest{color} (owner: [~yash.mayya]) > # {color:#00875a}RepartitionTopicTest{color} (streams) (owner: Christo) > # {color:#00875a}StateManagerUtilTest{color} (streams) (owner: Christo) > *The coverage report for the above tests after the change should be >= to > what the coverage is now.* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15147) Measure pending and outstanding Remote Segment operations
[ https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800832#comment-17800832 ] Christo Lolov commented on KAFKA-15147: --- Heya [~enether]! I believe all the work that we wanted to do in order to close KIP-963 is done. The only remaining bit is to update the JMX documentation on the Apache Kafka website, but as far as I am aware that is in a different repository and I was planning on doing it after New Years. Is this suitable for the 3.7 release or should I try to find time in the next couple of days to get the documentation sorted? > Measure pending and outstanding Remote Segment operations > - > > Key: KAFKA-15147 > URL: https://issues.apache.org/jira/browse/KAFKA-15147 > Project: Kafka > Issue Type: Improvement > Components: core >Reporter: Jorge Esteban Quilcate Otoya >Assignee: Christo Lolov >Priority: Major > Labels: tiered-storage > Fix For: 3.7.0 > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage > > KAFKA-15833: RemoteCopyLagBytes > KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, > RemoteDeleteLagSegments > KAFKA-16013: ExpiresPerSec > KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, > RemoteLogMetadataCount > KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, > BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec > > Remote Log Segment operations (copy/delete) are executed by the Remote > Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default > TopicBasedRLMM writes to the internal Kafka topic state changes on remote log > segments). > As executions run, fail, and retry; it will be important to know how many > operations are pending and outstanding over time to alert operators. > Pending operations are not enough to alert, as values can oscillate closer to > zero. An additional condition needs to apply (running time > threshold) to > consider an operation outstanding. > Proposal: > RemoteLogManager could be extended with 2 concurrent maps > (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure > segmentId time when operation started, and based on this expose 2 metrics per > operation: > * pendingSegmentCopies: gauge of pendingSegmentCopies map > * outstandingSegmentCopies: loop over pending ops, and if now - startedTime > > timeout, then outstanding++ (maybe on debug level?) > Is this a valuable metric to add to Tiered Storage? or better to solve on a > custom RLMM implementation? > Also, does it require a KIP? > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
jolshan commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437157116 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2639,59 +2639,65 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), Review Comment: Was there a reason we created a separated one here? I noticed that we change the config above if zkMigration is enabled, but not sure if that makes a difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800828#comment-17800828 ] Divij Vaidya commented on KAFKA-16052: -- Digging into the new heap dump after disabling GroupCoordinatorConcurrencyTest, I found another culprit. The sister test of GroupCoordinatorConcurrencyTest i.e. TransactionCoordinatorConcurrencyTest. !Screenshot 2023-12-27 at 17.44.09.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-27 at 17.44.09.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]
sciclon2 commented on code in PR #15030: URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364 ## tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java: ## @@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws Exception { props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig())); } props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, commandOptions.getBootstrapServer()); -props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); -props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); +if (!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); +} +if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); Review Comment: I will need to convert the timeoutMs.toMillis() to INT as the return value is long, this comes from [here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66) I will convert it in Int then like this `props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Integer.toString((int)timeoutMs.toMillis()));` agree? The change will be for both configs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]
sciclon2 commented on code in PR #15030: URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364 ## tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java: ## @@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws Exception { props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig())); } props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, commandOptions.getBootstrapServer()); -props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); -props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); +if (!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis())); +} +if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) { +props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, Long.toString(timeoutMs.toMillis() / 2)); Review Comment: I will need to convert the timeoutMs.toMillis() to INT as the return value is long, this comes from [here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66) I will convert it in Int then like this `props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, Integer.toString((int)timeoutMs.toMillis()));` agree? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (KAFKA-16043) Add Quota configuration for topics
[ https://issues.apache.org/jira/browse/KAFKA-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Afshin Moazami reassigned KAFKA-16043: -- Assignee: Afshin Moazami > Add Quota configuration for topics > -- > > Key: KAFKA-16043 > URL: https://issues.apache.org/jira/browse/KAFKA-16043 > Project: Kafka > Issue Type: New Feature >Reporter: Afshin Moazami >Assignee: Afshin Moazami >Priority: Major > > To be able to have topic-partition quota, we need to introduce two topic > configuration for the producer-byte-rate and consumer-byte-rate. > The assumption is that all partitions of the same topic get the same quota, > so we define one config per topic. > This configuration should work both with zookeeper and kraft setup. > Also, we should define a default quota value (to be discussed) and > potentially use the same format as user/client default configuration using > `` as the value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16044) Throttling using Topic Partition Quota
[ https://issues.apache.org/jira/browse/KAFKA-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Afshin Moazami reassigned KAFKA-16044: -- Assignee: Afshin Moazami > Throttling using Topic Partition Quota > --- > > Key: KAFKA-16044 > URL: https://issues.apache.org/jira/browse/KAFKA-16044 > Project: Kafka > Issue Type: New Feature >Reporter: Afshin Moazami >Assignee: Afshin Moazami >Priority: Major > > With > !https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype! > KAFKA-16042 introducing the topic-partition byte rate and metrics, and > !https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype! > KAFKA-16043 introducing the quota limit configuration in the topic-level, we > can enforce quota on topic-partition level for configured topics. > More details in the > [KIP-1010|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1010%3A+Topic+Partition+Quota] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Description: Assume a topic-partition has just a single leader replica in the ISR. Assume next that this replica goes offline. This replica's log will define the contents of that partition when the replica restarts, which is correct behavior. However, assume now that the replica has a disk failure, and we then replace the failed disk with a new, empty disk that we also format with the storage tool so it has the correct cluster ID. If we then restart the broker, the topic-partition will have no data in it, and any other replicas that might exist will truncate their logs to match, which results in data loss. See below for a step-by-step demo of how to reproduce this using KRaft (the issue impacts ZK-based implementations as well, but we supply only a KRaft-based reproduce case here): Note that implementing Eligible leader Replicas (https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue. STEPS TO REPRODUCE: Create a single broker cluster with single controller. The standard files under config/kraft work well: bin/kafka-storage.sh random-uuid J8qXRwI-Qyi2G0guFTiuYw #ensure we start clean /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/controller.properties bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker.properties bin/kafka-server-start.sh config/kraft/controller.properties bin/kafka-server-start.sh config/kraft/broker.properties bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 --partitions 1 --replication-factor 1 #create __consumer-offsets topics bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 --from-beginning ^C #confirm that __consumer_offsets topic partitions are all created and on broker with node id 2 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe Now create 2 more brokers, with node IDs 11 and 12 cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 's/localhost:9092/localhost:9011/g' | sed 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > config/kraft/broker11.properties cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 's/localhost:9092/localhost:9012/g' | sed 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > config/kraft/broker12.properties #ensure we start clean /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker11.properties bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker12.properties bin/kafka-server-start.sh config/kraft/broker11.properties bin/kafka-server-start.sh config/kraft/broker12.properties #create a topic with a single partition replicated on two brokers bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 --partitions 1 --replication-factor 2 #reassign partitions onto brokers with Node IDs 11 and 12 echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], "version":1}' > /tmp/reassign.json bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file /tmp/reassign.json --execute bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file /tmp/reassign.json --verify #make preferred leader 11 the actual leader if it not bin/kafka-leader-election.sh --bootstrap-server localhost:9092 --all-topic-partitions --election-type preferred #Confirm both brokers are in ISR and 11 is the leader bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2 Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1 ReplicationFactor: 2Configs: segment.bytes=1073741824 Topic: foo2 Partition: 0Leader: 11 Replicas: 11,12 Isr: 12,11 #Emit some messages to the topic bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic foo2 1 2 3 4 5 ^C #confirm we see the messages bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo2 --from-beginning 1 2 3 4 5 ^C #Again confirm both brokers are in ISR, leader is 11 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2 Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1 ReplicationFactor: 2Configs: segment.bytes=1073741824 Topic: foo2 Partition: 0Leader: 11 Replicas: 11,12 Isr: 12,11 #kill non-leader broker bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2 Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1 ReplicationFactor: 2
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Affects Version/s: 2.2.1 2.3.0 2.1.1 2.2.0 2.1.0 2.0.1 2.0.0 1.1.1 1.1.0 1.0.2 1.0.1 1.0.0 > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, > 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, > 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, > 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, > 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition in KRaft has just a single leader replica in the > ISR. Assume next that this replica goes offline. This replica's log will > define the contents of that partition when the replica restarts, which is > correct behavior. However, assume now that the replica has a disk failure, > and we then replace the failed disk with a new, empty disk that we also > format with the storage tool so it has the correct cluster ID. If we then > restart the broker, the topic-partition will have no data in it, and any > other replicas that might exist will truncate their logs to match, which > results in data loss. See below for a step-by-step demo of how to reproduce > this. > [KIP-858: Handle JBOD broker disk failure in > KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] > introduces the concept of a Disk UUID that we can use to solve this problem. > Specifically, when the leader restarts with an empty (but > correctly-formatted) disk, the actual UUID associated with the disk will be > different. The controller will notice upon broker re-registration that its > disk UUID differs from what was previously registered. Right now we have no > way of detecting this situation, but the disk UUID gives us that capability. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bi
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Affects Version/s: 2.7.2 2.6.3 3.1.0 2.6.2 2.7.1 2.8.0 2.6.1 2.7.0 2.5.1 2.6.0 2.4.1 2.5.0 2.3.1 2.4.0 2.2.2 > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, > 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 2.8.2, > 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, > 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition in KRaft has just a single leader replica in the > ISR. Assume next that this replica goes offline. This replica's log will > define the contents of that partition when the replica restarts, which is > correct behavior. However, assume now that the replica has a disk failure, > and we then replace the failed disk with a new, empty disk that we also > format with the storage tool so it has the correct cluster ID. If we then > restart the broker, the topic-partition will have no data in it, and any > other replicas that might exist will truncate their logs to match, which > results in data loss. See below for a step-by-step demo of how to reproduce > this. > [KIP-858: Handle JBOD broker disk failure in > KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] > introduces the concept of a Disk UUID that we can use to solve this problem. > Specifically, when the leader restarts with an empty (but > correctly-formatted) disk, the actual UUID associated with the disk will be > different. The controller will notice upon broker re-registration that its > disk UUID differs from what was previously registered. Right now we have no > way of detecting this situation, but the disk UUID gives us that capability. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bi
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Affects Version/s: 3.2.1 3.1.2 3.0.2 3.3.0 3.1.1 3.2.0 2.8.2 3.0.1 3.0.0 2.8.1 > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.8.1, 3.0.0, 3.0.1, 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, > 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, > 3.5.2, 3.6.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition in KRaft has just a single leader replica in the > ISR. Assume next that this replica goes offline. This replica's log will > define the contents of that partition when the replica restarts, which is > correct behavior. However, assume now that the replica has a disk failure, > and we then replace the failed disk with a new, empty disk that we also > format with the storage tool so it has the correct cluster ID. If we then > restart the broker, the topic-partition will have no data in it, and any > other replicas that might exist will truncate their logs to match, which > results in data loss. See below for a step-by-step demo of how to reproduce > this. > [KIP-858: Handle JBOD broker disk failure in > KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] > introduces the concept of a Disk UUID that we can use to solve this problem. > Specifically, when the leader restarts with an empty (but > correctly-formatted) disk, the actual UUID associated with the disk will be > different. The controller will notice upon broker re-registration that its > disk UUID differs from what was previously registered. Right now we have no > way of detecting this situation, but the disk UUID gives us that capability. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --execute > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --verify > #mak
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Affects Version/s: 3.6.1 3.5.2 3.5.0 3.3.1 3.2.3 3.2.2 3.4.0 > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, > 3.5.1, 3.5.2, 3.6.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition in KRaft has just a single leader replica in the > ISR. Assume next that this replica goes offline. This replica's log will > define the contents of that partition when the replica restarts, which is > correct behavior. However, assume now that the replica has a disk failure, > and we then replace the failed disk with a new, empty disk that we also > format with the storage tool so it has the correct cluster ID. If we then > restart the broker, the topic-partition will have no data in it, and any > other replicas that might exist will truncate their logs to match, which > results in data loss. See below for a step-by-step demo of how to reproduce > this. > [KIP-858: Handle JBOD broker disk failure in > KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] > introduces the concept of a Disk UUID that we can use to solve this problem. > Specifically, when the leader restarts with an empty (but > correctly-formatted) disk, the actual UUID associated with the disk will be > different. The controller will notice upon broker re-registration that its > disk UUID differs from what was previously registered. Right now we have no > way of detecting this situation, but the disk UUID gives us that capability. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --execute > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --verify > #make preferred leader 11 the actual leader if it not > bin/kafka-leader-election.sh --bootstrap-server localhost:9092 > --all-topic-partitions --election-type pre
[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk
[ https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino updated KAFKA-15495: -- Summary: Partition truncated when the only ISR member restarts with an empty disk (was: KRaft partition truncated when the only ISR member restarts with an empty disk) > Partition truncated when the only ISR member restarts with an empty disk > > > Key: KAFKA-15495 > URL: https://issues.apache.org/jira/browse/KAFKA-15495 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1 >Reporter: Ron Dagostino >Priority: Critical > > Assume a topic-partition in KRaft has just a single leader replica in the > ISR. Assume next that this replica goes offline. This replica's log will > define the contents of that partition when the replica restarts, which is > correct behavior. However, assume now that the replica has a disk failure, > and we then replace the failed disk with a new, empty disk that we also > format with the storage tool so it has the correct cluster ID. If we then > restart the broker, the topic-partition will have no data in it, and any > other replicas that might exist will truncate their logs to match, which > results in data loss. See below for a step-by-step demo of how to reproduce > this. > [KIP-858: Handle JBOD broker disk failure in > KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] > introduces the concept of a Disk UUID that we can use to solve this problem. > Specifically, when the leader restarts with an empty (but > correctly-formatted) disk, the actual UUID associated with the disk will be > different. The controller will notice upon broker re-registration that its > disk UUID differs from what was previously registered. Right now we have no > way of detecting this situation, but the disk UUID gives us that capability. > STEPS TO REPRODUCE: > Create a single broker cluster with single controller. The standard files > under config/kraft work well: > bin/kafka-storage.sh random-uuid > J8qXRwI-Qyi2G0guFTiuYw > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/controller.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker.properties > bin/kafka-server-start.sh config/kraft/controller.properties > bin/kafka-server-start.sh config/kraft/broker.properties > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 > --partitions 1 --replication-factor 1 > #create __consumer-offsets topics > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 > --from-beginning > ^C > #confirm that __consumer_offsets topic partitions are all created and on > broker with node id 2 > bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe > Now create 2 more brokers, with node IDs 11 and 12 > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed > 's/localhost:9092/localhost:9011/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > > config/kraft/broker11.properties > cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed > 's/localhost:9092/localhost:9012/g' | sed > 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > > config/kraft/broker12.properties > #ensure we start clean > /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker11.properties > bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config > config/kraft/broker12.properties > bin/kafka-server-start.sh config/kraft/broker11.properties > bin/kafka-server-start.sh config/kraft/broker12.properties > #create a topic with a single partition replicated on two brokers > bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 > --partitions 1 --replication-factor 2 > #reassign partitions onto brokers with Node IDs 11 and 12 > echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], > "version":1}' > /tmp/reassign.json > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --execute > bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 > --reassignment-json-file /tmp/reassign.json --verify > #make preferred leader 11 the actual leader if it not > bin/kafka-leader-election.sh --bootstrap-server localhost:9092 > --all-topic-partitions --election-type preferred > #Confirm both brokers are in ISR and 11 is the leader > bin/kafka-topic
[jira] [Commented] (KAFKA-16051) Deadlock on connector initialization
[ https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800815#comment-17800815 ] Octavian Ciubotaru commented on KAFKA-16051: I do not have the privileges to assign Jira ticket to myself nor assign you as reviewer in the PR on GitHub. > Deadlock on connector initialization > > > Key: KAFKA-16051 > URL: https://issues.apache.org/jira/browse/KAFKA-16051 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 2.6.3, 3.6.1 >Reporter: Octavian Ciubotaru >Priority: Major > > > Tested with Kafka 3.6.1 and 2.6.3. > The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4. > Stack trace for Kafka 3.6.1: > {noformat} > Found one Java-level deadlock: > = > "pool-3-thread-1": > waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder), > which is held by "Thread-9" > "Thread-9": > waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a > org.apache.kafka.connect.storage.MemoryConfigBackingStore), > which is held by "pool-3-thread-1"Java stack information for the threads > listed above: > === > "pool-3-thread-1": > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516) > - waiting to lock <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137) > - locked <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.21/Thread.java:829) > "Thread-9": > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129) > - waiting to lock <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255) > - locked <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50) > at > org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548) > at > io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86) > Found 1 deadlock. > {noformat} > The jdbc source connector is loading tables from the database and updates the > configuration once the list is available. The deadlock is very consistent in > my environment, probably because the database is on the same machine. > Maybe it is possible to avoid this situation by always locking the herder > first and the config backing store second. From what I see, > updateConnectorTasks sometimes is called before locking on herder and other > times it is not. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16051) Deadlock on connector initialization
[ https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800814#comment-17800814 ] Octavian Ciubotaru commented on KAFKA-16051: Hi [~gharris1727] , Thank you for you insights, I for sure still have a lot to learn about the project. In meantime I created the PR. The fix is to always lock on herder first and on the config backing store second. Tested in my environment and Kafka Connect starts successfully with the fix applied. > Deadlock on connector initialization > > > Key: KAFKA-16051 > URL: https://issues.apache.org/jira/browse/KAFKA-16051 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 2.6.3, 3.6.1 >Reporter: Octavian Ciubotaru >Priority: Major > > > Tested with Kafka 3.6.1 and 2.6.3. > The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4. > Stack trace for Kafka 3.6.1: > {noformat} > Found one Java-level deadlock: > = > "pool-3-thread-1": > waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder), > which is held by "Thread-9" > "Thread-9": > waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a > org.apache.kafka.connect.storage.MemoryConfigBackingStore), > which is held by "pool-3-thread-1"Java stack information for the threads > listed above: > === > "pool-3-thread-1": > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516) > - waiting to lock <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137) > - locked <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.21/Thread.java:829) > "Thread-9": > at > org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129) > - waiting to lock <0x9101ccd8> (a > org.apache.kafka.connect.storage.MemoryConfigBackingStore) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483) > at > org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255) > - locked <0x91002aa0> (a > org.apache.kafka.connect.runtime.standalone.StandaloneHerder) > at > org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50) > at > org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548) > at > io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86) > Found 1 deadlock. > {noformat} > The jdbc source connector is loading tables from the database and updates the > configuration once the list is available. The deadlock is very consistent in > my environment, probably because the database is on the same machine. > Maybe it is possible to avoid this situation by always locking the herder > first and the config backing store second. From what I see, > updateConnectorTasks sometimes is called before locking on herder and other > times it is not. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] KAFKA-16051: Fixed deadlock in StandaloneHerder [kafka]
developster opened a new pull request, #15080: URL: https://github.com/apache/kafka/pull/15080 *Description of the change* Changed StandaloneHerder to always synchronize on itself before invoking any methods on MemoryConfigBackingStore. This helped the situation as the order of acquiring locks is always the same. First on the herder and then on the config backing store. *Testing strategy* Manually tested StandaloneHerder from Kafka 2.6.3 and Kafka 3.6.1 and both have the same issue. 9 times out of 10 startup ends in deadlock. Tested after fixing (3.8.0-SNAPSHOT) and the deadlock is not happening anymore. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800797#comment-17800797 ] Divij Vaidya commented on KAFKA-16052: -- I am now running a single threaded test execution after disabling GroupCoordinatorConcurrencyTest. Once we have verified that this is the culprit test, we can try to find a fix. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15904) Downgrade tests are failing with directory.id
[ https://issues.apache.org/jira/browse/KAFKA-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Proven Provenzano resolved KAFKA-15904. --- Resolution: Fixed This was merged into trunk a month ago, long before the 3.7 branch was cut. I just forgot to close the ticket. > Downgrade tests are failing with directory.id > -- > > Key: KAFKA-15904 > URL: https://issues.apache.org/jira/browse/KAFKA-15904 > Project: Kafka > Issue Type: Bug >Reporter: Manikumar >Assignee: Proven Provenzano >Priority: Major > Fix For: 3.7.0 > > > {{kafkatest.tests.core.downgrade_test.TestDowngrade}} tests are failing after > [https://github.com/apache/kafka/pull/14628.] > We have added {{directory.id}} to metadata.properties. This means > {{metadata.properties}} will be different for different log directories. > Cluster downgrades will fail with below error if we have multiple log > directories . This looks blocker or requires additional downgrade steps from > AK 3.7. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15904) Downgrade tests are failing with directory.id
[ https://issues.apache.org/jira/browse/KAFKA-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Proven Provenzano updated KAFKA-15904: -- Fix Version/s: 3.7.0 (was: 3.8.0) > Downgrade tests are failing with directory.id > -- > > Key: KAFKA-15904 > URL: https://issues.apache.org/jira/browse/KAFKA-15904 > Project: Kafka > Issue Type: Bug >Reporter: Manikumar >Assignee: Proven Provenzano >Priority: Major > Fix For: 3.7.0 > > > {{kafkatest.tests.core.downgrade_test.TestDowngrade}} tests are failing after > [https://github.com/apache/kafka/pull/14628.] > We have added {{directory.id}} to metadata.properties. This means > {{metadata.properties}} will be different for different log directories. > Cluster downgrades will fail with below error if we have multiple log > directories . This looks blocker or requires additional downgrade steps from > AK 3.7. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-27 at 15.31.09.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800788#comment-17800788 ] Divij Vaidya commented on KAFKA-16052: -- ok, I might have found the offending test. It is probably GroupCoordinatorConcurrencyTest. This test alone uses 400MB of heap and the heap dump has a pattern consistent with the leaked objects we found above. !Screenshot 2023-12-27 at 15.31.09.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800772#comment-17800772 ] Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 2:01 PM: -Let's try to find the test which was running at the time of this heap dump capture by analysing the rest of the objects reachable by GC at this time.- Scratch that. It wouldn't work since the leaks could have occurred in previous tests and not necessarily in the ones that are running at the time of heap dump. I need to open the heap dump with a better tool than Intellij's profiler and see if I can explore the mocks to see what the mock is and what invocations is it storing. was (Author: divijvaidya): Let's try to find the test which was running at the time of this heap dump capture by analysing the rest of the objects reachable by GC at this time. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker
[ https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksandr Shulgin updated KAFKA-16054: -- Description: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] We still do not know how this other issue is happening. When the high CPU happens, top(1) reports a number of "data-plane-kafka..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one: "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 nid=0x20c runnable [0xfffed87fe000] java.lang.Thread.State: RUNNABLE #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) #011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) #011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) #011- locked <0x0006c1246410> (a sun.nio.ch.Util$2) #011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl) #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) #011at kafka.network.Processor.poll(SocketServer.scala:1107) #011at kafka.network.Processor.run(SocketServer.scala:1011) #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem". We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages. We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly: epoll_pwait(15558, [\{events=EPOLLOUT, data={u32=12286, u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/], 1024, 300, NULL, 8) = 1 fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable) Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart. We have found a bug report here that suggests that an issue may be partially due to a kernel bug: [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] (they are using version 5.15) We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions. A similar issue is reported here, affecting Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-4739 Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems. The issue is also reproducible on JDK 11 (as you can see from the stack trace, we are using 17). was: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty co
[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker
[ https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksandr Shulgin updated KAFKA-16054: -- Description: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] We still do not know how this other issue is happening. When the high CPU happens, top(1) reports a number of "data-plane-kafka..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one: "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 nid=0x20c runnable [0xfffed87fe000] java.lang.Thread.State: RUNNABLE #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) #011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) #011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) #011- locked <0x0006c1246410> (a sun.nio.ch.Util$2) #011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl) #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) #011at kafka.network.Processor.poll(SocketServer.scala:1107) #011at kafka.network.Processor.run(SocketServer.scala:1011) #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem". We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages. We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly: epoll_pwait(15558, [{events=EPOLLOUT, data={u32=12286, u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/], 1024, 300, NULL, 8) = 1 fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable) Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart. We have found a bug report here that suggests that an issue may be partially due to a kernel bug: [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] (they are using version 5.15) We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions. A similar issue is reported here, affecting Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-4739 Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems. The issue is also reproducible on JDK 11 (as you can see from the stack trace, we are using 17). was: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code)
[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker
[ https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksandr Shulgin updated KAFKA-16054: -- Description: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] We still do not know how this other issue is happening. When the high CPU happens, top(1) reports a number of "data-plane-kafka..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one: "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 nid=0x20c runnable [0xfffed87fe000] java.lang.Thread.State: RUNNABLE #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) #011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) #011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) #011- locked <0x0006c1246410> (a sun.nio.ch.Util$2) #011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl) #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) #011at kafka.network.Processor.poll(SocketServer.scala:1107) #011at kafka.network.Processor.run(SocketServer.scala:1011) #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem". We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages. We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly: epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286, u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/], 1024, 300, NULL, 8) = 1 fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable) Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart. We have found a bug report here that suggests that an issue may be partially due to a kernel bug: [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] (they are using version 5.15) We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions. A similar issue is reported here, affecting Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-4739 Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems. was: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-
[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker
[ https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksandr Shulgin updated KAFKA-16054: -- Description: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] We still do not know how this other issue is happening. When the high CPU happens, "top -H" reports a number of "data-plane-kafka-..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one: "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 nid=0x20c runnable [0xfffed87fe000] java.lang.Thread.State: RUNNABLE #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) #011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) #011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) #011- locked <0x0006c1246410> (a sun.nio.ch.Util$2) #011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl) #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) #011at kafka.network.Processor.poll(SocketServer.scala:1107) #011at kafka.network.Processor.run(SocketServer.scala:1011) #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem". We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages. We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly: epoll_pwait(15558, [\\{events=EPOLLOUT, data={u32=12286, u64=468381628432382}}], 1024, 300, NULL, 8) = 1 fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable) Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart. We have found a bug report here that suggests that an issue may be partially due to a kernel bug: [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] (they are using version 5.15) We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions. A similar issue is reported here, affecting Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-4739 Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems. was: We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800772#comment-17800772 ] Divij Vaidya commented on KAFKA-16052: -- Let's try to find the test which was running at the time of this heap dump capture by analysing the rest of the objects reachable by GC at this time. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16054) Sudden 100% CPU on a broker
Oleksandr Shulgin created KAFKA-16054: - Summary: Sudden 100% CPU on a broker Key: KAFKA-16054 URL: https://issues.apache.org/jira/browse/KAFKA-16054 Project: Kafka Issue Type: Bug Components: network Affects Versions: 3.6.1, 3.3.2 Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB, Amazon Linux Reporter: Oleksandr Shulgin We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted. After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See: [https://github.com/netty/netty/issues/327] [https://github.com/netty/netty/pull/565] (original workaround) [https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632] (same workaround in the current Netty code) The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20. Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: [https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy] We still do not know how this other issue is happening. When the high CPU happens, "top {-}H" reports a number of "data-plane-kafka{-}..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one: "data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 nid=0x20c runnable [0xfffed87fe000] java.lang.Thread.State: RUNNABLE #011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method) #011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118) #011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129) #011- locked <0x0006c1246410> (a sun.nio.ch.Util$2) #011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl) #011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141) #011at org.apache.kafka.common.network.Selector.select(Selector.java:874) #011at org.apache.kafka.common.network.Selector.poll(Selector.java:465) #011at kafka.network.Processor.poll(SocketServer.scala:1107) #011at kafka.network.Processor.run(SocketServer.scala:1011) #011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840) At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem". We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages. We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly: epoll_pwait(15558, [\{events=EPOLLOUT, data={u32=12286, u64=468381628432382}}], 1024, 300, NULL, 8) = 1 fstat(12019, {st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0 fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...} ) = 0 sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable) Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart. We have found a bug report here that suggests that an issue may be partially due to a kernel bug: [https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] (they are using version 5.15) We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions. A similar issue is reported here, affecting Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-4739 Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800770#comment-17800770 ] Divij Vaidya commented on KAFKA-16052: -- We have a org.mockito.internal.verification.DefaultRegisteredInvocations class using 590MB of heap. This class maintains a map of all mocks with their associated invocations. If we have a mock that is called many many times, the map will be very large to track all invocations and hence, cause this heap utilizations. Now we need to find which test has this mock that has around 767K invocations on a mock and that is our culprit. !Screenshot 2023-12-27 at 14.45.20.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-27 at 14.45.20.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Description: *Problem* Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] *Setup* To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} *Result of experiment* This is how the heap memory utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. The heap dump is available at [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] !Screenshot 2023-12-27 at 14.22.21.png! was: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. The heap dump is available at [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] !Screenshot 2023-12-27 at 14.22.21.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Description: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. The heap dump is available at [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] !Screenshot 2023-12-27 at 14.22.21.png! was: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. The heap dump is attached to this JIRA. !Screenshot 2023-12-27 at 14.22.21.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Description: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. The heap dump is attached to this JIRA. !Screenshot 2023-12-27 at 14.22.21.png! was: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. !Screenshot 2023-12-27 at 14.22.21.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ?
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Description: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. Note that the total number of threads also increases but it does not correlate with sharp increase in heap memory usage. !Screenshot 2023-12-27 at 14.22.21.png! was: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. !Screenshot 2023-12-27 at 14.22.21.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toIn
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Description: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} This is how the heap mempry utilized looks like, starting from tens of MB to ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. !Screenshot 2023-12-27 at 14.22.21.png! was: Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-27 at 14.22.21.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
divijvaidya commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437034332 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: That is fair. In that case, we should perhaps use `public static void closeQuietly(AutoCloseable closeable, String name, AtomicReference firstException)` and end the finally with something similar to ``` if (firstException != null) { throw new KafkaException("Failed to close", exception); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800764#comment-17800764 ] Divij Vaidya commented on KAFKA-16052: -- The tests are still running but as of now the root cause looks like a leak caused by mockito since it occupies ~950MB of heap. !Screenshot 2023-12-27 at 14.04.52.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-27 at 14.04.52.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png > > > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] KAFKA-16053: Fix leaks of KDC server in tests [kafka]
divijvaidya commented on PR #15079: URL: https://github.com/apache/kafka/pull/15079#issuecomment-1870288475 @dajac @ableegoldman please review. This is not the root cause of our recent OOM errors but is a contributor to it. (cc: @stanislavkozlovski as a potential backport to 3.7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService
[ https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16053: - Attachment: Screenshot 2023-12-27 at 13.18.33.png > Fix leaked Default DirectoryService > --- > > Key: KAFKA-16053 > URL: https://issues.apache.org/jira/browse/KAFKA-16053 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 13.18.33.png > > > Heap dump hinted towards a leaked DefaultDirectoryService while running > :core:test. It used 123MB of retained memory. > This Jira fixes the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService
[ https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16053: - Description: Heap dump hinted towards a leaked DefaultDirectoryService while running :core:test. It used 123MB of retained memory. This Jira fixes the leak. was: Heap dump hinted towards a leaked DefaultDirectoryService while running :core:test. It used 240MB of retained memory. This Jira fixes the leak. > Fix leaked Default DirectoryService > --- > > Key: KAFKA-16053 > URL: https://issues.apache.org/jira/browse/KAFKA-16053 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 13.18.33.png > > > Heap dump hinted towards a leaked DefaultDirectoryService while running > :core:test. It used 123MB of retained memory. > This Jira fixes the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437026129 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: So, I think I should only swallow in more than 1 instances to get close in finally block. Is that what you expect @satishd ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437024347 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: I think what Satish wants is we close both of them without throwing any exception to not closing the 2nd instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
divijvaidya commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1437022630 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: Do we want to close them quietly? This will suppress memory leaks since we won't get to know if the close is not happening correctly. It's better to fail when close has an error so that we get to know that something is wrong with tests, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] KAFKA-16053: Fix leaks of KDC server in tests [kafka]
divijvaidya opened a new pull request, #15079: URL: https://github.com/apache/kafka/pull/15079 # Problem We are facing OOM while running test suite for Apache Kafka as discussed in https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4 # Changes This JIRA fixes ~240MB of heap leak caused by DefaultDirectoryService objects in the heap memory. The `DefaultDirectoryService` is used by `MiniKdc.scala`. This change fixes a couple of leaks where the kdc service is not closed at the end of test run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (KAFKA-16053) Fix leaked Default DirectoryService
Divij Vaidya created KAFKA-16053: Summary: Fix leaked Default DirectoryService Key: KAFKA-16053 URL: https://issues.apache.org/jira/browse/KAFKA-16053 Project: Kafka Issue Type: Sub-task Reporter: Divij Vaidya Assignee: Divij Vaidya Heap dump hinted towards a leaked DefaultDirectoryService while running :core:test. It used 240MB of retained memory. This Jira fixes the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16052) OOM in Kafka test suite
Divij Vaidya created KAFKA-16052: Summary: OOM in Kafka test suite Key: KAFKA-16052 URL: https://issues.apache.org/jira/browse/KAFKA-16052 Project: Kafka Issue Type: Bug Affects Versions: 3.7.0 Reporter: Divij Vaidya Our test suite is failing with frequent OOM. Discussion in the mailing list is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] To find the source of leaks, I ran the :core:test build target with a single thread (see below on how to do it) and attached a profiler to it. This Jira tracks the list of action items identified from the analysis. How to run tests using a single thread: {code:java} diff --git a/build.gradle b/build.gradle index f7abbf4f0b..81df03f1ee 100644 --- a/build.gradle +++ b/build.gradle @@ -74,9 +74,8 @@ ext { "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" )- maxTestForks = project.hasProperty('maxParallelForks') ? maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() - maxScalacThreads = project.hasProperty('maxScalacThreads') ? maxScalacThreads.toInteger() : - Math.min(Runtime.runtime.availableProcessors(), 8) + maxTestForks = 1 + maxScalacThreads = 1 userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures : false userMaxTestRetries = project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 diff --git a/gradle.properties b/gradle.properties index 4880248cac..ee4b6e3bc1 100644 --- a/gradle.properties +++ b/gradle.properties @@ -30,4 +30,4 @@ scalaVersion=2.13.12 swaggerVersion=2.2.8 task=build org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC -org.gradle.parallel=true +org.gradle.parallel=false {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on PR #15077: URL: https://github.com/apache/kafka/pull/15077#issuecomment-1870263212 @satishd , I've closed all the instances in finally block quietly in this commit: https://github.com/apache/kafka/pull/15077/commits/dd913a8668cf773a51403e482cc704b44cb0e8a1 . Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: New year code cleanup - include final keyword [kafka]
vamossagar12 commented on PR #15072: URL: https://github.com/apache/kafka/pull/15072#issuecomment-1870244078 Thanks @divijvaidya , I think Ismael's point is correct. From what I recall, the streams final rule also expects the method arguements etc to be final as well (apart from local variables within methods). Enabling that rule would be a big change I guess. Also, I see only variables and parameters [here](https://checkstyle.sourceforge.io/checks/coding/finallocalvariable.html#FinalLocalVariable) for tokens to check. I think we can leave it to this but the only problem is, there won't be any such enforcements for using final variables and might need cleanups in the future. That should be ok IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
satishd commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1436993064 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -4020,6 +4031,7 @@ class ReplicaManagerTest { doneLatch.countDown() } finally { replicaManager.shutdown(checkpointHW = false) + remoteLogManager.close() Review Comment: Can you close/shutdown safely by swallowing any intermediate exceptions occur as mentioned below? ``` CoreUtils.swallow(replicaManager.shutdown(checkpointHW = false), this) CoreUtils.swallow(remoteLogManager.close(), this) ``` ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2704,28 +2710,30 @@ class ReplicaManagerTest { time = time, scheduler = time.scheduler, logManager = logManager, - quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""), + quotaManagers = quotaManager, metadataCache = MetadataCache.zkMetadataCache(config.brokerId, config.interBrokerProtocolVersion), logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size), alterPartitionManager = alterPartitionManager, threadNamePrefix = Option(this.getClass.getName)) -logManager.startup(Set.empty[String]) - -// Create a hosted topic, a hosted topic that will become stray, and a stray topic -val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet -createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet -createStrayLogs(10, logManager) +try { + logManager.startup(Set.empty[String]) -val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) + // Create a hosted topic, a hosted topic that will become stray, and a stray topic + val validLogs = createHostedLogs("hosted-topic", numLogs = 2, replicaManager).toSet + createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet + createStrayLogs(10, logManager) - replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) + val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new TopicPartition("hosted-topic", 1)) -assertEquals(validLogs, logManager.allLogs.toSet) -assertEquals(validLogs.size, replicaManager.partitionCount.value) + replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR)) -replicaManager.shutdown() -logManager.shutdown() + assertEquals(validLogs, logManager.allLogs.toSet) + assertEquals(validLogs.size, replicaManager.partitionCount.value) +} finally { + replicaManager.shutdown() + logManager.shutdown() Review Comment: Can you close both of them safely like below ignoring any intermediate failures? ``` CoreUtils.swallow(replicaManager.shutdown(checkpointHW = false), this) CoreUtils.swallow(remoteLogManager.close(), this) ``` ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -4116,6 +4128,7 @@ class ReplicaManagerTest { latch.countDown() } finally { replicaManager.shutdown(checkpointHW = false) + remoteLogManager.close() Review Comment: Please close safely as mentioned in the earlier comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]
showuon commented on code in PR #15077: URL: https://github.com/apache/kafka/pull/15077#discussion_r1436972102 ## core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala: ## @@ -2693,6 +2693,9 @@ class ReplicaManagerTest { } else { assertTrue(stray0.isInstanceOf[HostedPartition.Online]) } + +replicaManager.shutdown() +logManager.shutdown() Review Comment: You're right! Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] MINOR: New year code cleanup - include final keyword [kafka]
divijvaidya commented on PR #15072: URL: https://github.com/apache/kafka/pull/15072#issuecomment-1870156605 Checkstyle doesn't have a rule [1] available to enforce that fields which are not changing are marked as final. Hence, I am not changing anything in the checkstyle here in this PR. [1] https://checkstyle.sourceforge.io/checks.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org