[jira] [Created] (KAFKA-16056) Worker poll timeout expiry can lead to Duplicate task assignments.

2023-12-27 Thread Sagar Rao (Jira)
Sagar Rao created KAFKA-16056:
-

 Summary: Worker poll timeout expiry can lead to Duplicate task 
assignments.
 Key: KAFKA-16056
 URL: https://issues.apache.org/jira/browse/KAFKA-16056
 Project: Kafka
  Issue Type: Bug
  Components: KafkaConnect
Reporter: Sagar Rao
Assignee: Sagar Rao


When a poll timeout expiry happens for a worker, it triggers a rebalance 
because it leaves the group pro-actively. Under normal scenarios, this leaving 
the group would trigger a scheduled rebalance delay. However, one thing to note 
is that, the worker which left the group temporarily, doesn't give up it's 
assignments and whatever tasks were running on it would remain as is. When the 
scheduled rebalance delay elapses, it would just get back it's assignments but 
given that there won't be any revocations, it should all work out fine.

But there is an edge case here. Let's assume that a scheduled rebalance delay 
was already active on a group and just before a follow up rebalance due to 
scheduled rebalance elapsing, one of the worker's poll timeout expires. At this 
point, a rebalance is imminent and the leader would track the assignments of 
the transiently departed worker as lost 
[here|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L255]
 . When 
[handleLostAssignments|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L441]
 gets triggered, because the scheduledRebalance delay isn't reset yet and if 
[this|https://github.com/apache/kafka/blob/d582d5aff517879b150bc2739bad99df07e15e2b/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/IncrementalCooperativeAssignor.java#L473]
 condition passes, the leader would assume that it needs to reassign all the 
lost assignments which it will.

But because, the worker for which the poll timeout expired, doesn't rescind 
it's assignments we would end up noticing duplicate assignments- one set on the 
original worker which was already running the tasks and connectors and another 
set on the remaining group of workers which got the redistributed work. This 
could lead to task failures if connector has been written in a way which 
expects no duplicate tasks running across a cluster.

Also, this edge case can be encountered more frequently if the 
`rebalance.timeout.ms` config is set to a lower value. 

One of the approaches could be to do something similar to 
https://issues.apache.org/jira/browse/KAFKA-9184 where upon coordinator 
discovery failure, the worker gives up all it's assignments and joins with an 
empty assignment. We could do something similar in this case as well.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]

2023-12-27 Thread via GitHub


sciclon2 commented on code in PR #15030:
URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364


##
tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java:
##
@@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws 
Exception {
 
props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig()));
 }
 props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, 
commandOptions.getBootstrapServer());
-props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
-props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));
+if 
(!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
+}
+if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));

Review Comment:
   @divijvaidya , I will need to convert the timeoutMs.toMillis()  to INT as 
the return value is long, this comes from 
[here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66)
  I will convert it in Int then like this 
`props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Integer.toString((int)timeoutMs.toMillis()));` agree? The change will be for 
both configs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] KAFKA-15556: Remove NetworkClientDelegate methods isUnavailable, maybeThrowAuthFailure, and tryConnect [kafka]

2023-12-27 Thread via GitHub


Phuc-Hong-Tran commented on PR #15020:
URL: https://github.com/apache/kafka/pull/15020#issuecomment-1870808021

   @philipnee @kirktrue, Please take a look if you guys have some free time. 
Thanks in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (KAFKA-15948) Refactor AsyncKafkaConsumer shutdown

2023-12-27 Thread Philip Nee (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Nee resolved KAFKA-15948.

  Assignee: Philip Nee  (was: Phuc Hong Tran)
Resolution: Fixed

> Refactor AsyncKafkaConsumer shutdown
> 
>
> Key: KAFKA-15948
> URL: https://issues.apache.org/jira/browse/KAFKA-15948
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Philip Nee
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.8.0
>
>
> Upon closing we need a round trip from the network thread to the application 
> thread and then back to the network thread to complete the callback 
> invocation.  Currently, we don't have any of that.  I think we need to 
> refactor our closing mechanism.  There are a few points to the refactor:
>  # The network thread should know if there's a custom user callback to 
> trigger or not.  If there is, it should wait for the callback completion to 
> send a leave group.  If not, it should proceed with the shutdown.
>  # The application thread sends a closing signal to the network thread and 
> continuously polls the background event handler until time runs out.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15948) Refactor AsyncKafkaConsumer shutdown

2023-12-27 Thread Phuc Hong Tran (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phuc Hong Tran reassigned KAFKA-15948:
--

Assignee: Phuc Hong Tran

> Refactor AsyncKafkaConsumer shutdown
> 
>
> Key: KAFKA-15948
> URL: https://issues.apache.org/jira/browse/KAFKA-15948
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, consumer
>Reporter: Philip Nee
>Assignee: Phuc Hong Tran
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.8.0
>
>
> Upon closing we need a round trip from the network thread to the application 
> thread and then back to the network thread to complete the callback 
> invocation.  Currently, we don't have any of that.  I think we need to 
> refactor our closing mechanism.  There are a few points to the refactor:
>  # The network thread should know if there's a custom user callback to 
> trigger or not.  If there is, it should wait for the callback completion to 
> send a leave group.  If not, it should proceed with the shutdown.
>  # The application thread sends a closing signal to the network thread and 
> continuously polls the background event handler until time runs out.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] KAFKA-15344: Commit leader epoch where possible [kafka]

2023-12-27 Thread via GitHub


github-actions[bot] commented on PR #14454:
URL: https://github.com/apache/kafka/pull/14454#issuecomment-1870790976

   This PR is being marked as stale since it has not had any activity in 90 
days. If you would like to keep this PR alive, please ask a committer for 
review. If the PR has  merge conflicts, please update it with the latest from 
trunk (or appropriate release branch)  If this PR is no longer valid or 
desired, please feel free to close it. If no activity occurs in the next 30 
days, it will be automatically closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: Range assignor changes [kafka]

2023-12-27 Thread via GitHub


github-actions[bot] commented on PR #14345:
URL: https://github.com/apache/kafka/pull/14345#issuecomment-1870791020

   This PR is being marked as stale since it has not had any activity in 90 
days. If you would like to keep this PR alive, please ask a committer for 
review. If the PR has  merge conflicts, please update it with the latest from 
trunk (or appropriate release branch)  If this PR is no longer valid or 
desired, please feel free to close it. If no activity occurs in the next 30 
days, it will be automatically closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437350376


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   I've updated the PR. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437350213


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2639,59 +2639,65 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),

Review Comment:
   I agree we should not change any test semantics here. I updated the PR to 
close the new created `quotaManager` explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


satishd commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437344252


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -167,7 +167,7 @@ class ReplicaManagerTest {
 .foreach(checkpointFile => assertTrue(Files.exists(checkpointFile),
   s"checkpoint file does not exist at $checkpointFile"))
 } finally {
-  rm.shutdown(checkpointHW = false)
+  CoreUtils.swallow(rm.shutdown(checkpointHW = false), this)

Review Comment:
   Why is the exception swallowed here and all other places with the latest 
change? I did not mean to suggest this in my review. I clarified that in 
another 
[comment](https://github.com/apache/kafka/pull/15077#discussion_r1437341409).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800911#comment-17800911
 ] 

Justine Olshan commented on KAFKA-16052:


What do we think about lowering the number of sequences we test (say from 10 to 
3 or 4) and the number of groups from 10 to 5 or so. 
That would lower the number of operations from 35,000 to 7,000. This is used in 
both the GroupCoordinator and the TransactionCoordinator test too. (The 
transaction coordinator doesn't have groups, but transactions). We could also 
look at some of the other tests in the suite. (To be honest, I never really 
understood how this suite was actually testing concurrency with all the mocks 
and redefined operations)

We probably need to figure out why we store all of these mock invocations since 
we are just executing the methods anyway. But I think lowering the number of 
operations should give us some breathing room for the OOMs. 

What do you think [~divijvaidya] [~ijuma] 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


satishd commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437341409


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   Right @showuon , what I meant here was try closing the second one even if 
the earlier invocation has thrown an error to avoid leaks caused by not closing 
the second instance. We should not change any of the other places to swallow 
the exceptions where there is only one single invocation of shutting down of 
replicamanager or logmanager. 
   
   One way to do that is what @divijvaidya suggested. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800909#comment-17800909
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:35 AM:
--

So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

For the testConcurrentRandomSequence test, we call these methods approximately 
35,000 times.

We do 10 random sequences and 10 ordered sequences. Each sequence consists of 7 
operations for the 250 group members (nthreads(5) * 5 * 10). 
So 20 * 7 * 250 is 35,000. Not sure if we need this many operations!


was (Author: jolshan):
So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800909#comment-17800909
 ] 

Justine Olshan commented on KAFKA-16052:


So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800903#comment-17800903
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:05 AM:
--

It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition.

This is overridden in TestReplicaManager to just do the callback. I'm not sure 
why it is storing all of these though. I will look into that next.


was (Author: jolshan):
It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition there.

This is overridden to just do the callback. I'm not sure why it is storing all 
of these though. I will look into that next.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800903#comment-17800903
 ] 

Justine Olshan commented on KAFKA-16052:


It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition there.

This is overridden to just do the callback. I'm not sure why it is storing all 
of these though. I will look into that next.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800902#comment-17800902
 ] 

Justine Olshan commented on KAFKA-16052:


Thanks Divij. This might be a few things then since the handleTxnCommitOffsets 
is not in the TransactionCoordinator test. 
I will look into this though. 

I believe the OOMs started before I changed the transactional offset commit 
flow, but perhaps it is still related to something else I changed.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-15997) Ensure fairness in the uniform assignor

2023-12-27 Thread Ritika Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900
 ] 

Ritika Reddy edited comment on KAFKA-15997 at 12/28/23 1:41 AM:


Hey, from what I understand you are trying to test the scenario where 8 members 
are subscribed to the same topic with 16 partitions. We want to test whether 
each member gets a balanced allocation which would be two partitions each in 
this case. I have a few follow up questions:
 * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 
partitions?
 * Is this for the first assignment or is this testing a rebalance scenario 
after some subscription changes like the test name suggests? If it's the later, 
what changes were made to trigger a rebalance?
 * Is this topic the only subscription or are there other topics that the 
members are subscribed to?


was (Author: JIRAUSER300287):
Hey, from what I understand you are trying to test the scenario where 8 members 
are subscribed to the same topic with 16 partitions. We want to test whether 
each member gets a balanced allocation which would be two partitions each in 
this case. I have a few follow up questions:
 * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 
partitions?
 * Is this for the first assignment or is this testing a rebalance scenario 
after some subscription changes like the test name suggests?
 * Is this topic the only subscription or are there other topics that the 
members are subscribed to?

> Ensure fairness in the uniform assignor
> ---
>
> Key: KAFKA-15997
> URL: https://issues.apache.org/jira/browse/KAFKA-15997
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Emanuele Sabellico
>Assignee: Ritika Reddy
>Priority: Minor
>
>  
>  
> Fairness has to be ensured in uniform assignor as it was in 
> cooperative-sticky one.
> There's this test 0113 subtest u_multiple_subscription_changes in librdkafka 
> where 8 consumers are subscribing to the same topic, and it's verifying that 
> all of them are getting 2 partitions assigned. But with new protocol it seems 
> two consumers get assigned 3 partitions and 1 has zero partitions. The test 
> doesn't configure any client.rack.
> {code:java}
> [0113_cooperative_rebalance  /478.183s] Consumer assignments 
> (subscription_variation 0) (stabilized) (no rebalance cb):
> [0113_cooperative_rebalance  /478.183s] Consumer C_0#consumer-3 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs)
> [0113_cooperative_rebalance  /478.183s] Consumer C_1#consumer-4 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 assignment 
> (0): 
> [0113_cooperative_rebalance  /478.184s] Consumer C_7#consumer-10 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] 16/32 partitions assigned
> [0113_cooperative_rebalance  /478.184s] Consumer C_0#consumer-3 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_1#consumer-4 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2

[jira] [Commented] (KAFKA-15997) Ensure fairness in the uniform assignor

2023-12-27 Thread Ritika Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800901#comment-17800901
 ] 

Ritika Reddy commented on KAFKA-15997:
--

In case of the first assignment, I have written a test for this scenario and 
got a balanced assignment.

public void testFirstAssignmentEightMembersOneTopicNoMemberRacks() {
Map topicMetadata = new HashMap<>();
topicMetadata.put(topic1Uuid, new TopicMetadata(
topic1Uuid,
topic1Name,
16,
mkMapOfPartitionRacks(3)
));

Map members = new TreeMap<>();
for (int i = 0 ; i < 8 ; i++) {
members.put("member" + i, new AssignmentMemberSpec(
Optional.empty(),
Optional.empty(),
Arrays.asList(topic1Uuid),
Collections.emptyMap()
));
}

AssignmentSpec assignmentSpec = new AssignmentSpec(members);
SubscribedTopicMetadata subscribedTopicMetadata = new 
SubscribedTopicMetadata(topicMetadata);

GroupAssignment computedAssignment = assignor.assign(assignmentSpec, 
subscribedTopicMetadata);
System.out.println(computedAssignment);

computedAssignment.members().forEach((member, assignment) -> assertEquals(2, 
assignment.targetPartitions().get(topic1Uuid).size()));

checkValidityAndBalance(members, computedAssignment);
}

> Ensure fairness in the uniform assignor
> ---
>
> Key: KAFKA-15997
> URL: https://issues.apache.org/jira/browse/KAFKA-15997
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Emanuele Sabellico
>Assignee: Ritika Reddy
>Priority: Minor
>
>  
>  
> Fairness has to be ensured in uniform assignor as it was in 
> cooperative-sticky one.
> There's this test 0113 subtest u_multiple_subscription_changes in librdkafka 
> where 8 consumers are subscribing to the same topic, and it's verifying that 
> all of them are getting 2 partitions assigned. But with new protocol it seems 
> two consumers get assigned 3 partitions and 1 has zero partitions. The test 
> doesn't configure any client.rack.
> {code:java}
> [0113_cooperative_rebalance  /478.183s] Consumer assignments 
> (subscription_variation 0) (stabilized) (no rebalance cb):
> [0113_cooperative_rebalance  /478.183s] Consumer C_0#consumer-3 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs)
> [0113_cooperative_rebalance  /478.183s] Consumer C_1#consumer-4 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 assignment 
> (0): 
> [0113_cooperative_rebalance  /478.184s] Consumer C_7#consumer-10 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] 16/32 partitions assigned
> [0113_cooperative_rebalance  /478.184s] Consumer C_0#consumer-3 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_1#consumer-4 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 has 0 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.1

[jira] [Comment Edited] (KAFKA-15997) Ensure fairness in the uniform assignor

2023-12-27 Thread Ritika Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900
 ] 

Ritika Reddy edited comment on KAFKA-15997 at 12/28/23 1:39 AM:


Hey, from what I understand you are trying to test the scenario where 8 members 
are subscribed to the same topic with 16 partitions. We want to test whether 
each member gets a balanced allocation which would be two partitions each in 
this case. I have a few follow up questions:
 * I see the log +16/32 partitions assigned+ : Are there 16 partitions or 32 
partitions?
 * Is this for the first assignment or is this testing a rebalance scenario 
after some subscription changes like the test name suggests?
 * Is this topic the only subscription or are there other topics that the 
members are subscribed to?


was (Author: JIRAUSER300287):
Hey, from what I understand you are trying to test the scenario where 8 members 
are subscribed to the same topic with 16 partitions. We want to test whether 
each member gets a balanced allocation which would be two partitions each in 
this case. I have a few follow up questions:
 * 
16/32 partitions assigned

> Ensure fairness in the uniform assignor
> ---
>
> Key: KAFKA-15997
> URL: https://issues.apache.org/jira/browse/KAFKA-15997
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Emanuele Sabellico
>Assignee: Ritika Reddy
>Priority: Minor
>
>  
>  
> Fairness has to be ensured in uniform assignor as it was in 
> cooperative-sticky one.
> There's this test 0113 subtest u_multiple_subscription_changes in librdkafka 
> where 8 consumers are subscribing to the same topic, and it's verifying that 
> all of them are getting 2 partitions assigned. But with new protocol it seems 
> two consumers get assigned 3 partitions and 1 has zero partitions. The test 
> doesn't configure any client.rack.
> {code:java}
> [0113_cooperative_rebalance  /478.183s] Consumer assignments 
> (subscription_variation 0) (stabilized) (no rebalance cb):
> [0113_cooperative_rebalance  /478.183s] Consumer C_0#consumer-3 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs)
> [0113_cooperative_rebalance  /478.183s] Consumer C_1#consumer-4 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 assignment 
> (0): 
> [0113_cooperative_rebalance  /478.184s] Consumer C_7#consumer-10 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] 16/32 partitions assigned
> [0113_cooperative_rebalance  /478.184s] Consumer C_0#consumer-3 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_1#consumer-4 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 has 0 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] 

[jira] [Commented] (KAFKA-15997) Ensure fairness in the uniform assignor

2023-12-27 Thread Ritika Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800900#comment-17800900
 ] 

Ritika Reddy commented on KAFKA-15997:
--

Hey, from what I understand you are trying to test the scenario where 8 members 
are subscribed to the same topic with 16 partitions. We want to test whether 
each member gets a balanced allocation which would be two partitions each in 
this case. I have a few follow up questions:
 * 
16/32 partitions assigned

> Ensure fairness in the uniform assignor
> ---
>
> Key: KAFKA-15997
> URL: https://issues.apache.org/jira/browse/KAFKA-15997
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Emanuele Sabellico
>Assignee: Ritika Reddy
>Priority: Minor
>
>  
>  
> Fairness has to be ensured in uniform assignor as it was in 
> cooperative-sticky one.
> There's this test 0113 subtest u_multiple_subscription_changes in librdkafka 
> where 8 consumers are subscribing to the same topic, and it's verifying that 
> all of them are getting 2 partitions assigned. But with new protocol it seems 
> two consumers get assigned 3 partitions and 1 has zero partitions. The test 
> doesn't configure any client.rack.
> {code:java}
> [0113_cooperative_rebalance  /478.183s] Consumer assignments 
> (subscription_variation 0) (stabilized) (no rebalance cb):
> [0113_cooperative_rebalance  /478.183s] Consumer C_0#consumer-3 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [5] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [8] (4000msgs)
> [0113_cooperative_rebalance  /478.183s] Consumer C_1#consumer-4 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [0] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [3] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [13] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [6] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [10] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [7] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [9] (2000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [11] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [14] (3000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 assignment 
> (3): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [1] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [2] (2000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [4] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 assignment 
> (0): 
> [0113_cooperative_rebalance  /478.184s] Consumer C_7#consumer-10 assignment 
> (2): rdkafkatest_rnd24419cc75e59d8de_0113u_1 [12] (1000msgs), 
> rdkafkatest_rnd24419cc75e59d8de_0113u_1 [15] (1000msgs)
> [0113_cooperative_rebalance  /478.184s] 16/32 partitions assigned
> [0113_cooperative_rebalance  /478.184s] Consumer C_0#consumer-3 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_1#consumer-4 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_2#consumer-5 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_3#consumer-6 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_4#consumer-7 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_5#consumer-8 has 3 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_6#consumer-9 has 0 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [0113_cooperative_rebalance  /478.184s] Consumer C_7#consumer-10 has 2 
> assigned partitions (1 subscribed topic(s)), expecting 2 assigned partitions
> [                      /479.057s] 1 test(s) running: 
> 0113_cooperative_rebalance
> [                      /480.057s] 1 test(s) running: 
> 0113_cooperative_rebalance
> [                      /481.057s] 1 test(s) running: 
> 0113_cooperative_rebalance
> [0113_cooperative_rebalance  /482.498s] TEST FAILURE
> ### Test "0113_cooperative_rebalance (u_multiple_subscription_changes:2390: 
> use_rebalance_cb: 0, subscription_variation: 0)" failed at 
> test.c:1243:check_test_timeouts() at Thu Dec  7 15:52:15 2023: ###
> Test 0113_cooperative_rebalance (u_multiple_subscription_chang

[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800893#comment-17800893
 ] 

Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 11:23 PM:
-

Also, when we look at what is inside the InterceptedInvocation objects, it 
hints at "
groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", 
producerId, producerEpoch,
JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, 
JoinGroupRequest.UNKNOWN_GENERATION_ID,
offsets, callbackWithTxnCompletion)" line. Note that there are ~700L 
invocations in mockito for this.

(You can also play with the heap dump I linked above to find this information)

!Screenshot 2023-12-28 at 00.18.56.png!


was (Author: divijvaidya):
Also, when we look at what is inside the InterceptedInvocation objects, it 
hints at "
groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", 
producerId, producerEpoch,
JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, 
JoinGroupRequest.UNKNOWN_GENERATION_ID,
offsets, callbackWithTxnCompletion)" line. Note that there are ~700L 
invocations in mockito for this.


!Screenshot 2023-12-28 at 00.18.56.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800893#comment-17800893
 ] 

Divij Vaidya commented on KAFKA-16052:
--

Also, when we look at what is inside the InterceptedInvocation objects, it 
hints at "
groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", 
producerId, producerEpoch,
JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, 
JoinGroupRequest.UNKNOWN_GENERATION_ID,
offsets, callbackWithTxnCompletion)" line. Note that there are ~700L 
invocations in mockito for this.


!Screenshot 2023-12-28 at 00.18.56.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-28 at 00.18.56.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800892#comment-17800892
 ] 

Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 11:19 PM:
-

I don't think that clearing the mocks is helping here. I ran "./gradlew 
:core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap 
dump. The test spikes to take 778MB heap at peak. Here's the heap dump 
[https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0]
 

For reference, this is how I was clearing the mocks
{code:java}
@@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends 
AbstractCoordinatorConcurrencyTest
   @AfterEach
   override def tearDown(): Unit = {
     try {
-      if (groupCoordinator != null)
-        groupCoordinator.shutdown()
+      CoreUtils.swallow(groupCoordinator.shutdown(), this)
+      CoreUtils.swallow(metrics.close(), this)
     } finally {
       super.tearDown()
     }
+    Mockito.framework().clearInlineMocks()
   } {code}
!Screenshot 2023-12-28 at 00.13.06.png!


was (Author: divijvaidya):
I don't think that clearing the mocks is helping here. I ran "./gradlew 
:core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap 
dump. The test spikes to take 778MB heap at peak. Here's the heap dump 
[https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0]
 

For reference, this is how I was clearing the mocks


{code:java}
@@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends 
AbstractCoordinatorConcurrencyTest
   @AfterEach
   override def tearDown(): Unit = {
     try {
-      if (groupCoordinator != null)
-        groupCoordinator.shutdown()
+      CoreUtils.swallow(groupCoordinator.shutdown(), this)
+      CoreUtils.swallow(metrics.close(), this)
     } finally {
       super.tearDown()
     }
+    Mockito.framework().clearInlineMocks()
   } {code}


!Screenshot 2023-12-28 at 00.13.06.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800892#comment-17800892
 ] 

Divij Vaidya commented on KAFKA-16052:
--

I don't think that clearing the mocks is helping here. I ran "./gradlew 
:core:test --tests GroupCoordinatorConcurrencyTest --rerun" and took a heap 
dump. The test spikes to take 778MB heap at peak. Here's the heap dump 
[https://www.dropbox.com/scl/fi/dttxe7e8mzifaaw57er1d/GradleWorkerMain_79331_28_12_2023_00_10_47.hprof.zip?rlkey=cg6w2e26vzprbrz92gcdhwfkd&dl=0]
 

For reference, this is how I was clearing the mocks


{code:java}
@@ -89,11 +91,12 @@ class GroupCoordinatorConcurrencyTest extends 
AbstractCoordinatorConcurrencyTest
   @AfterEach
   override def tearDown(): Unit = {
     try {
-      if (groupCoordinator != null)
-        groupCoordinator.shutdown()
+      CoreUtils.swallow(groupCoordinator.shutdown(), this)
+      CoreUtils.swallow(metrics.close(), this)
     } finally {
       super.tearDown()
     }
+    Mockito.framework().clearInlineMocks()
   } {code}


!Screenshot 2023-12-28 at 00.13.06.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-28 at 00.13.06.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16055) Thread unsafe use of HashMap stored in QueryableStoreProvider#storeProviders

2023-12-27 Thread Kohei Nozaki (Jira)
Kohei Nozaki created KAFKA-16055:


 Summary: Thread unsafe use of HashMap stored in 
QueryableStoreProvider#storeProviders
 Key: KAFKA-16055
 URL: https://issues.apache.org/jira/browse/KAFKA-16055
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.6.1
Reporter: Kohei Nozaki


This was originally raised in [a kafka-users 
post|https://lists.apache.org/thread/gpct1275bfqovlckptn3lvf683qpoxol].

There is a HashMap stored in QueryableStoreProvider#storeProviders ([code 
link|https://github.com/apache/kafka/blob/3.6.1/streams/src/main/java/org/apache/kafka/streams/state/internals/QueryableStoreProvider.java#L39])
 which can be mutated by a KafkaStreams#removeStreamThread() call. This can be 
problematic when KafkaStreams#store is called from a separate thread.

We need to somehow make this part of code thread-safe by replacing it by 
ConcurrentHashMap or/and using an existing locking mechanism.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] Duplicate method; The QuotaUtils one is used. [kafka]

2023-12-27 Thread via GitHub


jolshan commented on PR #15066:
URL: https://github.com/apache/kafka/pull/15066#issuecomment-1870661247

   @afshing -- can you link the commit that moved the code in your description?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] KAFKA-16045 Fix flaky testMigrateTopicDeletions [kafka]

2023-12-27 Thread via GitHub


mumrah opened a new pull request, #15082:
URL: https://github.com/apache/kafka/pull/15082

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800880#comment-17800880
 ] 

Justine Olshan commented on KAFKA-16052:


There are only four tests in the group coordinator suite, so it is surprising 
to hit OOM on those. But trying the clearInlineMocks should be a low hanging 
fruit that could help a bit while we figure out why the mocks are causing so 
many issues in the first place.

I have a draft PR with the change and we can look at how the builds do: 
https://github.com/apache/kafka/pull/15081

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] WIP -- clear mocks in AbstractCoordinatorConcurrencyTest [kafka]

2023-12-27 Thread via GitHub


jolshan opened a new pull request, #15081:
URL: https://github.com/apache/kafka/pull/15081

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800877#comment-17800877
 ] 

Justine Olshan commented on KAFKA-16052:


Ah – good point [~ijuma]. I will look into that as well.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800875#comment-17800875
 ] 

Ismael Juma commented on KAFKA-16052:
-

It can be done via Mockito.framework().clearInlineMocks().

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800874#comment-17800874
 ] 

Ismael Juma commented on KAFKA-16052:
-

Have you tried clearing the mocks during tearDown?

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800865#comment-17800865
 ] 

Justine Olshan commented on KAFKA-16052:


Taking a look at removing the mock as well. There are quite a few parts of the 
initialization that we don't need to do for the tests. 

For example:


val replicaFetcherManager = createReplicaFetcherManager(metrics, time, 
threadNamePrefix, quotaManagers.follower)
private[server] val replicaAlterLogDirsManager = 
createReplicaAlterLogDirsManager(quotaManagers.alterLogDirs, brokerTopicStats)

We can do these if necessary, but it also starts other threads that we probably 
don't want running for the test.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky

2023-12-27 Thread David Arthur (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Arthur reassigned KAFKA-16045:


Assignee: David Arthur

> ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
> -
>
> Key: KAFKA-16045
> URL: https://issues.apache.org/jira/browse/KAFKA-16045
> Project: Kafka
>  Issue Type: Test
>Reporter: Justine Olshan
>Assignee: David Arthur
>Priority: Major
>  Labels: flaky-test
>
> I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many 
> builds. I believe it is also causing a thread leak because on most runs where 
> it fails I also see ReplicaManager tests also fail with extra threads. 
> The test always fails 
> `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be 
> deleted`
> gradle enterprise link:
> [https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTim[…]lues=trunk&tests.container=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.timeZoneId=America%2FLos_Angeles&search.values=trunk&tests.container=kafka.zk.ZkMigrationIntegrationTest]
> recent pr: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/]
> trunk builds: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests],
>  
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests]
>  (edited) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:20 PM:
--

Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.


was (Author: jolshan):
Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not 
sure if this is something else.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:19 PM:
--

Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not 
sure if this is something else.


was (Author: jolshan):
Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800855#comment-17800855
 ] 

Justine Olshan commented on KAFKA-16052:


Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800853#comment-17800853
 ] 

Divij Vaidya commented on KAFKA-16052:
--

Yes Justine, that is my current line of thought as well. The mock of 
TestReplicaManager is having many invocations. I am now trying to directly use 
TestReplicaManager instead of the mock in the test. I think that we are mocking 
it just to avoid calling the constructor of replicamanager (so that we can use 
our default purgatory). If you can get that working, it would be awesome!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800852#comment-17800852
 ] 

Justine Olshan commented on KAFKA-16052:


I can also take a look at the heap dump if that is helpful.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800851#comment-17800851
 ] 

Justine Olshan commented on KAFKA-16052:


Thanks Divij for the digging. These tests share the common 
AbstractCoordinatorCouncurrencyTest. I wonder if the mock in question is there.

We do mock replica manager, but in an interesting way


replicaManager = mock(classOf[TestReplicaManager], 
withSettings().defaultAnswer(CALLS_REAL_METHODS))

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] KAFKA-16047: Leverage the fenceProducers timeout in the InitProducerId [kafka]

2023-12-27 Thread via GitHub


gharris1727 commented on code in PR #15078:
URL: https://github.com/apache/kafka/pull/15078#discussion_r1437200184


##
clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java:
##
@@ -82,9 +86,10 @@ InitProducerIdRequest.Builder buildSingleRequest(int 
brokerId, CoordinatorKey ke
 .setProducerEpoch(ProducerIdAndEpoch.NONE.epoch)
 .setProducerId(ProducerIdAndEpoch.NONE.producerId)
 .setTransactionalId(key.idValue)
-// Set transaction timeout to 1 since it's only being initialized 
to fence out older producers with the same transactional ID,
-// and shouldn't be used for any actual record writes
-.setTransactionTimeoutMs(1);
+// Set transaction timeout to the equivalent as the fenceProducers 
request since it's only being initialized to fence out older producers with the 
same transactional ID,
+// and shouldn't be used for any actual record writes. This has 
been changed to match the fenceProducers request timeout from one as some 
brokers may be slower than expected
+// and we need a safe timeout that allows the transaction init to 
finish.
+.setTransactionTimeoutMs(this.options.timeoutMs());

Review Comment:
   The timeoutMs is nullable, and is always null when using the default 
FenceProducersOptions.
   When it is null we should use the KafkaAdminClient#defaultApiTimeoutMs, 
similar to deleteRecords or calcDeadlineMs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800845#comment-17800845
 ] 

Ron Dagostino commented on KAFKA-15495:
---

Thanks, [~jsancio].  I've updated the title and description to make it clear 
this is a general problem as opposed to being KRaft-specific, and I've 
indicated it affects all released versions back to 1.0.0.  I've also linked it 
to the ELR ticket at https://issues.apache.org/jira/browse/KAFKA-15332.




> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 
> 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition has just a single leader replica in the ISR.  Assume 
> next that this replica goes offline.  This replica's log will define the 
> contents of that partition when the replica restarts, which is correct 
> behavior.  However, assume now that the replica has a disk failure, and we 
> then replace the failed disk with a new, empty disk that we also format with 
> the storage tool so it has the correct cluster ID.  If we then restart the 
> broker, the topic-partition will have no data in it, and any other replicas 
> that might exist will truncate their logs to match, which results in data 
> loss.  See below for a step-by-step demo of how to reproduce this using KRaft 
> (the issue impacts ZK-based implementations as well, but we supply only a 
> KRaft-based reproduce case here):
> Note that implementing Eligible leader Replicas 
> (https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhos

[jira] [Commented] (KAFKA-14132) Remaining PowerMock to Mockito tests

2023-12-27 Thread Christo Lolov (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800833#comment-17800833
 ] 

Christo Lolov commented on KAFKA-14132:
---

Heya [~enether], thanks for checking in on this one! I need to do another pass 
to see whether everything with PowerMock has indeed been moved - I still have a 
couple of classes with EasyMock which need to be moved and given the timeline 
that won't happen for 3.7. What is left here, which I haven't documented, is 
that the PowerMock dependency needs to be removed from Kafka and the build 
script needs to be changed!

> Remaining PowerMock to Mockito tests
> 
>
> Key: KAFKA-14132
> URL: https://issues.apache.org/jira/browse/KAFKA-14132
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Christo Lolov
>Assignee: Christo Lolov
>Priority: Major
> Fix For: 3.8.0
>
>
> {color:#de350b}Some of the tests below use EasyMock as well. For those 
> migrate both PowerMock and EasyMock to Mockito.{color}
> Unless stated in brackets the tests are in the connect module.
> A list of tests which still require to be moved from PowerMock to Mockito as 
> of 2nd of August 2022 which do not have a Jira issue and do not have pull 
> requests I am aware of which are opened:
> {color:#ff8b00}InReview{color}
> {color:#00875a}Merged{color}
>  # {color:#00875a}ErrorHandlingTaskTest{color} (owner: [~shekharrajak])
>  # {color:#00875a}SourceTaskOffsetCommiterTest{color} (owner: Christo)
>  # {color:#00875a}WorkerMetricsGroupTest{color} (owner: Divij)
>  # {color:#00875a}WorkerTaskTest{color} (owner: [~yash.mayya])
>  # {color:#00875a}ErrorReporterTest{color} (owner: [~yash.mayya])
>  # {color:#00875a}RetryWithToleranceOperatorTest{color} (owner: [~yash.mayya])
>  # {color:#00875a}WorkerErrantRecordReporterTest{color} (owner: [~yash.mayya])
>  # {color:#00875a}ConnectorsResourceTest{color} (owner: [~mdedetrich-aiven])
>  # {color:#00875a}StandaloneHerderTest{color} (owner: [~mdedetrich-aiven])
>  # KafkaConfigBackingStoreTest (owner: [~bachmanity1])
>  # {color:#00875a}KafkaOffsetBackingStoreTest{color} (owner: Christo) 
> ([https://github.com/apache/kafka/pull/12418])
>  # {color:#00875a}KafkaBasedLogTest{color} (owner: @bachmanity ])
>  # {color:#00875a}RetryUtilTest{color} (owner: [~yash.mayya])
>  # {color:#00875a}RepartitionTopicTest{color} (streams) (owner: Christo)
>  # {color:#00875a}StateManagerUtilTest{color} (streams) (owner: Christo)
> *The coverage report for the above tests after the change should be >= to 
> what the coverage is now.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15147) Measure pending and outstanding Remote Segment operations

2023-12-27 Thread Christo Lolov (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800832#comment-17800832
 ] 

Christo Lolov commented on KAFKA-15147:
---

Heya [~enether]! I believe all the work that we wanted to do in order to close 
KIP-963 is done. The only remaining bit is to update the JMX documentation on 
the Apache Kafka website, but as far as I am aware that is in a different 
repository and I was planning on doing it after New Years. Is this suitable for 
the 3.7 release or should I try to find time in the next couple of days to get 
the documentation sorted?

> Measure pending and outstanding Remote Segment operations
> -
>
> Key: KAFKA-15147
> URL: https://issues.apache.org/jira/browse/KAFKA-15147
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Christo Lolov
>Priority: Major
>  Labels: tiered-storage
> Fix For: 3.7.0
>
>
>  
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-963%3A+Upload+and+delete+lag+metrics+in+Tiered+Storage
>  
> KAFKA-15833: RemoteCopyLagBytes 
> KAFKA-16002: RemoteCopyLagSegments, RemoteDeleteLagBytes, 
> RemoteDeleteLagSegments
> KAFKA-16013: ExpiresPerSec
> KAFKA-16014: RemoteLogSizeComputationTime, RemoteLogSizeBytes, 
> RemoteLogMetadataCount
> KAFKA-15158: RemoteDeleteRequestsPerSec, RemoteDeleteErrorsPerSec, 
> BuildRemoteLogAuxStateRequestsPerSec, BuildRemoteLogAuxStateErrorsPerSec
> 
> Remote Log Segment operations (copy/delete) are executed by the Remote 
> Storage Manager, and recorded by Remote Log Metadata Manager (e.g. default 
> TopicBasedRLMM writes to the internal Kafka topic state changes on remote log 
> segments).
> As executions run, fail, and retry; it will be important to know how many 
> operations are pending and outstanding over time to alert operators.
> Pending operations are not enough to alert, as values can oscillate closer to 
> zero. An additional condition needs to apply (running time > threshold) to 
> consider an operation outstanding.
> Proposal:
> RemoteLogManager could be extended with 2 concurrent maps 
> (pendingSegmentCopies, pendingSegmentDeletes) `Map[Uuid, Long]` to measure 
> segmentId time when operation started, and based on this expose 2 metrics per 
> operation:
>  * pendingSegmentCopies: gauge of pendingSegmentCopies map
>  * outstandingSegmentCopies: loop over pending ops, and if now - startedTime 
> > timeout, then outstanding++ (maybe on debug level?)
> Is this a valuable metric to add to Tiered Storage? or better to solve on a 
> custom RLMM implementation?
> Also, does it require a KIP?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


jolshan commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437157116


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2639,59 +2639,65 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),

Review Comment:
   Was there a reason we created a separated one here? I noticed that we change 
the config above if zkMigration is enabled, but not sure if that makes a 
difference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800828#comment-17800828
 ] 

Divij Vaidya commented on KAFKA-16052:
--

Digging into the new heap dump after disabling GroupCoordinatorConcurrencyTest, 
I found another culprit. The sister test of GroupCoordinatorConcurrencyTest 
i.e. 
TransactionCoordinatorConcurrencyTest. 

!Screenshot 2023-12-27 at 17.44.09.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-27 at 17.44.09.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]

2023-12-27 Thread via GitHub


sciclon2 commented on code in PR #15030:
URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364


##
tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java:
##
@@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws 
Exception {
 
props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig()));
 }
 props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, 
commandOptions.getBootstrapServer());
-props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
-props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));
+if 
(!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
+}
+if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));

Review Comment:
   I will need to convert the timeoutMs.toMillis()  to INT as the return value 
is long, this comes from 
[here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66)
  I will convert it in Int then like this 
`props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Integer.toString((int)timeoutMs.toMillis()));` agree? The change will be for 
both configs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [KAFKA-16015] Fix custom timeouts overwritten by defaults [kafka]

2023-12-27 Thread via GitHub


sciclon2 commented on code in PR #15030:
URL: https://github.com/apache/kafka/pull/15030#discussion_r1437137364


##
tools/src/main/java/org/apache/kafka/tools/LeaderElectionCommand.java:
##
@@ -99,8 +99,12 @@ static void run(Duration timeoutMs, String... args) throws 
Exception {
 
props.putAll(Utils.loadProps(commandOptions.getAdminClientConfig()));
 }
 props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, 
commandOptions.getBootstrapServer());
-props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
-props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));
+if 
(!props.containsKey(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis()));
+}
+if (!props.containsKey(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG)) {
+props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, 
Long.toString(timeoutMs.toMillis() / 2));

Review Comment:
   I will need to convert the timeoutMs.toMillis()  to INT as the return value 
is long, this comes from 
[here](https://github.com/apache/kafka/pull/15030/files#diff-7f2ce49cbfada7fbfff8453254f686d835710e2ee620b8905670e69ceaaa1809R66)
  I will convert it in Int then like this 
`props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, 
Integer.toString((int)timeoutMs.toMillis()));` agree?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (KAFKA-16043) Add Quota configuration for topics

2023-12-27 Thread Afshin Moazami (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Afshin Moazami reassigned KAFKA-16043:
--

Assignee: Afshin Moazami

> Add Quota configuration for topics
> --
>
> Key: KAFKA-16043
> URL: https://issues.apache.org/jira/browse/KAFKA-16043
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Afshin Moazami
>Assignee: Afshin Moazami
>Priority: Major
>
> To be able to have topic-partition quota, we need to introduce two topic 
> configuration for the producer-byte-rate and consumer-byte-rate. 
> The assumption is that all partitions of the same topic get the same quota, 
> so we define one config per topic. 
> This configuration should work both with zookeeper and kraft setup. 
> Also, we should define a default quota value (to be discussed) and 
> potentially use the same format as user/client default configuration using 
> `` as the value. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-16044) Throttling using Topic Partition Quota

2023-12-27 Thread Afshin Moazami (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Afshin Moazami reassigned KAFKA-16044:
--

Assignee: Afshin Moazami

> Throttling using Topic Partition Quota 
> ---
>
> Key: KAFKA-16044
> URL: https://issues.apache.org/jira/browse/KAFKA-16044
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Afshin Moazami
>Assignee: Afshin Moazami
>Priority: Major
>
> With 
> !https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!
>  KAFKA-16042 introducing the topic-partition byte rate and metrics, and 
> !https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!
>  KAFKA-16043 introducing the quota limit configuration in the topic-level, we 
> can enforce quota on topic-partition level for configured topics. 
> More details in the 
> [KIP-1010|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1010%3A+Topic+Partition+Quota]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Description: 
Assume a topic-partition has just a single leader replica in the ISR.  Assume 
next that this replica goes offline.  This replica's log will define the 
contents of that partition when the replica restarts, which is correct 
behavior.  However, assume now that the replica has a disk failure, and we then 
replace the failed disk with a new, empty disk that we also format with the 
storage tool so it has the correct cluster ID.  If we then restart the broker, 
the topic-partition will have no data in it, and any other replicas that might 
exist will truncate their logs to match, which results in data loss.  See below 
for a step-by-step demo of how to reproduce this using KRaft (the issue impacts 
ZK-based implementations as well, but we supply only a KRaft-based reproduce 
case here):

Note that implementing Eligible leader Replicas 
(https://issues.apache.org/jira/browse/KAFKA-15332) will resolve this issue.

STEPS TO REPRODUCE:

Create a single broker cluster with single controller.  The standard files 
under config/kraft work well:

bin/kafka-storage.sh random-uuid
J8qXRwI-Qyi2G0guFTiuYw

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/controller.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker.properties

bin/kafka-server-start.sh config/kraft/controller.properties
bin/kafka-server-start.sh config/kraft/broker.properties

bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
--partitions 1 --replication-factor 1

#create __consumer-offsets topics
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
--from-beginning
^C

#confirm that __consumer_offsets topic partitions are all created and on broker 
with node id 2
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe

Now create 2 more brokers, with node IDs 11 and 12

cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
's/localhost:9092/localhost:9011/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
config/kraft/broker11.properties
cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
's/localhost:9092/localhost:9012/g' |  sed 
's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
config/kraft/broker12.properties

#ensure we start clean
/bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12

bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker11.properties
bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
config/kraft/broker12.properties

bin/kafka-server-start.sh config/kraft/broker11.properties
bin/kafka-server-start.sh config/kraft/broker12.properties

#create a topic with a single partition replicated on two brokers
bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
--partitions 1 --replication-factor 2

#reassign partitions onto brokers with Node IDs 11 and 12
echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
"version":1}' > /tmp/reassign.json

bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --execute
bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
--reassignment-json-file /tmp/reassign.json --verify

#make preferred leader 11 the actual leader if it not
bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
--all-topic-partitions --election-type preferred

#Confirm both brokers are in ISR and 11 is the leader
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2Configs: segment.bytes=1073741824
Topic: foo2 Partition: 0Leader: 11  Replicas: 11,12 Isr: 
12,11


#Emit some messages to the topic
bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic foo2
1
2
3
4
5
^C

#confirm we see the messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo2 
--from-beginning
1
2
3
4
5
^C

#Again confirm both brokers are in ISR, leader is 11
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2Configs: segment.bytes=1073741824
Topic: foo2 Partition: 0Leader: 11  Replicas: 11,12 Isr: 
12,11

#kill non-leader broker
bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic foo2
Topic: foo2 TopicId: pbbQZ23UQ5mQqmZpoSRCLQ PartitionCount: 1   
ReplicationFactor: 2 

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 2.2.1
   2.3.0
   2.1.1
   2.2.0
   2.1.0
   2.0.1
   2.0.0
   1.1.1
   1.1.0
   1.0.2
   1.0.1
   1.0.0

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 2.0.0, 2.0.1, 2.1.0, 
> 2.2.0, 2.1.1, 2.3.0, 2.2.1, 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 
> 2.7.0, 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 
> 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 
> 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bi

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 2.7.2
   2.6.3
   3.1.0
   2.6.2
   2.7.1
   2.8.0
   2.6.1
   2.7.0
   2.5.1
   2.6.0
   2.4.1
   2.5.0
   2.3.1
   2.4.0
   2.2.2

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.2.2, 2.4.0, 2.3.1, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 
> 2.6.1, 2.8.0, 2.7.1, 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1, 2.8.2, 
> 3.2.0, 3.1.1, 3.3.0, 3.0.2, 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 
> 3.5.0, 3.4.1, 3.6.0, 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bi

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 3.2.1
   3.1.2
   3.0.2
   3.3.0
   3.1.1
   3.2.0
   2.8.2
   3.0.1
   3.0.0
   2.8.1

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.1, 3.0.0, 3.0.1, 2.8.2, 3.2.0, 3.1.1, 3.3.0, 3.0.2, 
> 3.1.2, 3.2.1, 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1, 
> 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #mak

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Affects Version/s: 3.6.1
   3.5.2
   3.5.0
   3.3.1
   3.2.3
   3.2.2
   3.4.0

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.4.0, 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 
> 3.5.1, 3.5.2, 3.6.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
> --all-topic-partitions --election-type pre

[jira] [Updated] (KAFKA-15495) Partition truncated when the only ISR member restarts with an empty disk

2023-12-27 Thread Ron Dagostino (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Dagostino updated KAFKA-15495:
--
Summary: Partition truncated when the only ISR member restarts with an 
empty disk  (was: KRaft partition truncated when the only ISR member restarts 
with an empty disk)

> Partition truncated when the only ISR member restarts with an empty disk
> 
>
> Key: KAFKA-15495
> URL: https://issues.apache.org/jira/browse/KAFKA-15495
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.2, 3.4.1, 3.6.0, 3.5.1
>Reporter: Ron Dagostino
>Priority: Critical
>
> Assume a topic-partition in KRaft has just a single leader replica in the 
> ISR.  Assume next that this replica goes offline.  This replica's log will 
> define the contents of that partition when the replica restarts, which is 
> correct behavior.  However, assume now that the replica has a disk failure, 
> and we then replace the failed disk with a new, empty disk that we also 
> format with the storage tool so it has the correct cluster ID.  If we then 
> restart the broker, the topic-partition will have no data in it, and any 
> other replicas that might exist will truncate their logs to match, which 
> results in data loss.  See below for a step-by-step demo of how to reproduce 
> this.
> [KIP-858: Handle JBOD broker disk failure in 
> KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft]
>  introduces the concept of a Disk UUID that we can use to solve this problem. 
>  Specifically, when the leader restarts with an empty (but 
> correctly-formatted) disk, the actual UUID associated with the disk will be 
> different.  The controller will notice upon broker re-registration that its 
> disk UUID differs from what was previously registered.  Right now we have no 
> way of detecting this situation, but the disk UUID gives us that capability.
> STEPS TO REPRODUCE:
> Create a single broker cluster with single controller.  The standard files 
> under config/kraft work well:
> bin/kafka-storage.sh random-uuid
> J8qXRwI-Qyi2G0guFTiuYw
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/controller.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker.properties
> bin/kafka-server-start.sh config/kraft/controller.properties
> bin/kafka-server-start.sh config/kraft/broker.properties
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 
> --partitions 1 --replication-factor 1
> #create __consumer-offsets topics
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 
> --from-beginning
> ^C
> #confirm that __consumer_offsets topic partitions are all created and on 
> broker with node id 2
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe
> Now create 2 more brokers, with node IDs 11 and 12
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 
> 's/localhost:9092/localhost:9011/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > 
> config/kraft/broker11.properties
> cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 
> 's/localhost:9092/localhost:9012/g' |  sed 
> 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > 
> config/kraft/broker12.properties
> #ensure we start clean
> /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker11.properties
> bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config 
> config/kraft/broker12.properties
> bin/kafka-server-start.sh config/kraft/broker11.properties
> bin/kafka-server-start.sh config/kraft/broker12.properties
> #create a topic with a single partition replicated on two brokers
> bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 
> --partitions 1 --replication-factor 2
> #reassign partitions onto brokers with Node IDs 11 and 12
> echo '{"partitions":[{"topic": "foo2","partition": 0,"replicas": [11,12]}], 
> "version":1}' > /tmp/reassign.json
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --execute
> bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 
> --reassignment-json-file /tmp/reassign.json --verify
> #make preferred leader 11 the actual leader if it not
> bin/kafka-leader-election.sh --bootstrap-server localhost:9092 
> --all-topic-partitions --election-type preferred
> #Confirm both brokers are in ISR and 11 is the leader
> bin/kafka-topic

[jira] [Commented] (KAFKA-16051) Deadlock on connector initialization

2023-12-27 Thread Octavian Ciubotaru (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800815#comment-17800815
 ] 

Octavian Ciubotaru commented on KAFKA-16051:


I do not have the privileges to assign Jira ticket to myself nor assign you as 
reviewer in the PR on GitHub.

> Deadlock on connector initialization
> 
>
> Key: KAFKA-16051
> URL: https://issues.apache.org/jira/browse/KAFKA-16051
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.6.3, 3.6.1
>Reporter: Octavian Ciubotaru
>Priority: Major
>
>  
> Tested with Kafka 3.6.1 and 2.6.3.
> The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4.
> Stack trace for Kafka 3.6.1:
> {noformat}
> Found one Java-level deadlock:
> =
> "pool-3-thread-1":
>   waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder),
>   which is held by "Thread-9"
> "Thread-9":
>   waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore),
>   which is held by "pool-3-thread-1"Java stack information for the threads 
> listed above:
> ===
> "pool-3-thread-1":
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516)
>     - waiting to lock <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137)
>     - locked <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown
>  Source)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515)
>     at 
> java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
>     at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
> "Thread-9":
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129)
>     - waiting to lock <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255)
>     - locked <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50)
>     at 
> org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548)
>     at 
> io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86)
> Found 1 deadlock.
> {noformat}
> The jdbc source connector is loading tables from the database and updates the 
> configuration once the list is available. The deadlock is very consistent in 
> my environment, probably because the database is on the same machine.
> Maybe it is possible to avoid this situation by always locking the herder 
> first and the config backing store second. From what I see, 
> updateConnectorTasks sometimes is called before locking on herder and other 
> times it is not.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16051) Deadlock on connector initialization

2023-12-27 Thread Octavian Ciubotaru (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800814#comment-17800814
 ] 

Octavian Ciubotaru commented on KAFKA-16051:


Hi [~gharris1727] , Thank you for you insights, I for sure still have a lot to 
learn about the project.

In meantime I created the PR. The fix is to always lock on herder first and on 
the config backing store second. Tested in my environment and Kafka Connect 
starts successfully with the fix applied.

> Deadlock on connector initialization
> 
>
> Key: KAFKA-16051
> URL: https://issues.apache.org/jira/browse/KAFKA-16051
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.6.3, 3.6.1
>Reporter: Octavian Ciubotaru
>Priority: Major
>
>  
> Tested with Kafka 3.6.1 and 2.6.3.
> The only plugin installed is confluentinc-kafka-connect-jdbc-10.7.4.
> Stack trace for Kafka 3.6.1:
> {noformat}
> Found one Java-level deadlock:
> =
> "pool-3-thread-1":
>   waiting to lock monitor 0x7fbc88006300 (object 0x91002aa0, a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder),
>   which is held by "Thread-9"
> "Thread-9":
>   waiting to lock monitor 0x7fbc88008800 (object 0x9101ccd8, a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore),
>   which is held by "pool-3-thread-1"Java stack information for the threads 
> listed above:
> ===
> "pool-3-thread-1":
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$ConfigUpdateListener.onTaskConfigUpdate(StandaloneHerder.java:516)
>     - waiting to lock <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:137)
>     - locked <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.lambda$null$2(StandaloneHerder.java:229)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder$$Lambda$692/0x000840557440.run(Unknown
>  Source)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.21/Executors.java:515)
>     at 
> java.util.concurrent.FutureTask.run(java.base@11.0.21/FutureTask.java:264)
>     at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.21/ScheduledThreadPoolExecutor.java:304)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.21/ThreadPoolExecutor.java:1128)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.21/ThreadPoolExecutor.java:628)
>     at java.lang.Thread.run(java.base@11.0.21/Thread.java:829)
> "Thread-9":
>     at 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore.putTaskConfigs(MemoryConfigBackingStore.java:129)
>     - waiting to lock <0x9101ccd8> (a 
> org.apache.kafka.connect.storage.MemoryConfigBackingStore)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.updateConnectorTasks(StandaloneHerder.java:483)
>     at 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder.requestTaskReconfiguration(StandaloneHerder.java:255)
>     - locked <0x91002aa0> (a 
> org.apache.kafka.connect.runtime.standalone.StandaloneHerder)
>     at 
> org.apache.kafka.connect.runtime.HerderConnectorContext.requestTaskReconfiguration(HerderConnectorContext.java:50)
>     at 
> org.apache.kafka.connect.runtime.WorkerConnector$WorkerConnectorContext.requestTaskReconfiguration(WorkerConnector.java:548)
>     at 
> io.confluent.connect.jdbc.source.TableMonitorThread.run(TableMonitorThread.java:86)
> Found 1 deadlock.
> {noformat}
> The jdbc source connector is loading tables from the database and updates the 
> configuration once the list is available. The deadlock is very consistent in 
> my environment, probably because the database is on the same machine.
> Maybe it is possible to avoid this situation by always locking the herder 
> first and the config backing store second. From what I see, 
> updateConnectorTasks sometimes is called before locking on herder and other 
> times it is not.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] KAFKA-16051: Fixed deadlock in StandaloneHerder [kafka]

2023-12-27 Thread via GitHub


developster opened a new pull request, #15080:
URL: https://github.com/apache/kafka/pull/15080

   *Description of the change*
   Changed StandaloneHerder to always synchronize on itself before invoking any 
methods on MemoryConfigBackingStore. This helped the situation as the order of 
acquiring locks is always the same. First on the herder and then on the config 
backing store.
   
   *Testing strategy*
   Manually tested StandaloneHerder from Kafka 2.6.3 and Kafka 3.6.1 and both 
have the same issue. 9 times out of 10 startup ends in deadlock. Tested after 
fixing (3.8.0-SNAPSHOT) and the deadlock is not happening anymore.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800797#comment-17800797
 ] 

Divij Vaidya commented on KAFKA-16052:
--

I am now running a single threaded test execution after disabling 
GroupCoordinatorConcurrencyTest. Once we have verified that this is the culprit 
test, we can try to find a fix.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15904) Downgrade tests are failing with directory.id 

2023-12-27 Thread Proven Provenzano (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Proven Provenzano resolved KAFKA-15904.
---
Resolution: Fixed

This was merged into trunk a month ago, long before the 3.7 branch was cut. I 
just forgot to close the ticket.

> Downgrade tests are failing with directory.id 
> --
>
> Key: KAFKA-15904
> URL: https://issues.apache.org/jira/browse/KAFKA-15904
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Proven Provenzano
>Priority: Major
> Fix For: 3.7.0
>
>
> {{kafkatest.tests.core.downgrade_test.TestDowngrade}} tests are failing after 
> [https://github.com/apache/kafka/pull/14628.] 
> We have added {{directory.id}} to metadata.properties. This means 
> {{metadata.properties}} will be different for different log directories.
> Cluster downgrades will fail with below error if we have multiple log 
> directories . This looks blocker or requires additional downgrade steps from 
> AK 3.7. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15904) Downgrade tests are failing with directory.id 

2023-12-27 Thread Proven Provenzano (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Proven Provenzano updated KAFKA-15904:
--
Fix Version/s: 3.7.0
   (was: 3.8.0)

> Downgrade tests are failing with directory.id 
> --
>
> Key: KAFKA-15904
> URL: https://issues.apache.org/jira/browse/KAFKA-15904
> Project: Kafka
>  Issue Type: Bug
>Reporter: Manikumar
>Assignee: Proven Provenzano
>Priority: Major
> Fix For: 3.7.0
>
>
> {{kafkatest.tests.core.downgrade_test.TestDowngrade}} tests are failing after 
> [https://github.com/apache/kafka/pull/14628.] 
> We have added {{directory.id}} to metadata.properties. This means 
> {{metadata.properties}} will be different for different log directories.
> Cluster downgrades will fail with below error if we have multiple log 
> directories . This looks blocker or requires additional downgrade steps from 
> AK 3.7. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-27 at 15.31.09.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800788#comment-17800788
 ] 

Divij Vaidya commented on KAFKA-16052:
--

ok, I might have found the offending test. It is probably 
GroupCoordinatorConcurrencyTest.

This test alone uses 400MB of heap and the heap dump has a pattern consistent 
with the leaked objects we found above. 

 

!Screenshot 2023-12-27 at 15.31.09.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800772#comment-17800772
 ] 

Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 2:01 PM:


-Let's try to find the test which was running at the time of this heap dump 
capture by analysing the rest of the objects reachable by GC at this time.-

 

Scratch that. It wouldn't work since the leaks could have occurred in previous 
tests and not necessarily in the ones that are running at the time of heap 
dump. 

I need to open the heap dump with a better tool than Intellij's profiler and 
see if I can explore the mocks to see what the mock is and what invocations is 
it storing.


was (Author: divijvaidya):
Let's try to find the test which was running at the time of this heap dump 
capture by analysing the rest of the objects reachable by GC at this time.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

2023-12-27 Thread Oleksandr Shulgin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shulgin updated KAFKA-16054:
--
Description: 
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 
nid=0x20c runnable [0xfffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x0006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\{events=EPOLLOUT, data={u32=12286, 
u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/],
 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.

The issue is also reproducible on JDK 11 (as you can see from the stack trace, 
we are using 17).

  was:
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty co

[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

2023-12-27 Thread Oleksandr Shulgin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shulgin updated KAFKA-16054:
--
Description: 
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 
nid=0x20c runnable [0xfffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x0006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [{events=EPOLLOUT, data={u32=12286, 
u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/],
 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.

The issue is also reproducible on JDK 11 (as you can see from the stack trace, 
we are using 17).

  was:
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

2023-12-27 Thread Oleksandr Shulgin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shulgin updated KAFKA-16054:
--
Description: 
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, top(1) reports a number of "data-plane-kafka..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 
nid=0x20c runnable [0xfffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x0006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\\\{events=EPOLLOUT, data={u32=12286, 
u64=468381628432382}}|file://\{events=epollout,%20data={u32=12286,%20u64=468381628432382}}/],
 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.

  was:
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-

[jira] [Updated] (KAFKA-16054) Sudden 100% CPU on a broker

2023-12-27 Thread Oleksandr Shulgin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleksandr Shulgin updated KAFKA-16054:
--
Description: 
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, "top -H" reports a number of "data-plane-kafka-..." 
threads consuming ~60% user and ~40% system CPU, and the thread dump contains a 
lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 
nid=0x20c runnable [0xfffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x0006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\\{events=EPOLLOUT, data={u32=12286, 
u64=468381628432382}}], 1024, 300, NULL, 8) = 1
fstat(12019,\{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.

  was:
We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in 

[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800772#comment-17800772
 ] 

Divij Vaidya commented on KAFKA-16052:
--

Let's try to find the test which was running at the time of this heap dump 
capture by analysing the rest of the objects reachable by GC at this time.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16054) Sudden 100% CPU on a broker

2023-12-27 Thread Oleksandr Shulgin (Jira)
Oleksandr Shulgin created KAFKA-16054:
-

 Summary: Sudden 100% CPU on a broker
 Key: KAFKA-16054
 URL: https://issues.apache.org/jira/browse/KAFKA-16054
 Project: Kafka
  Issue Type: Bug
  Components: network
Affects Versions: 3.6.1, 3.3.2
 Environment: Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB,  Amazon 
Linux
Reporter: Oleksandr Shulgin


We have observed now for the 3rd time in production the issue where a Kafka 
broker will suddenly jump to 100% CPU usage and will not recover on its own 
until manually restarted.

After a deeper investigation, we now believe that this is an instance of the 
infamous epoll bug. See:
[https://github.com/netty/netty/issues/327]
[https://github.com/netty/netty/pull/565] (original workaround)
[https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632]
 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the 
other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced 
messages) I was asking about on the users mailing list in September, but to 
date got not a single reply, unfortunately: 
[https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy]

We still do not know how this other issue is happening.

When the high CPU happens, "top {-}H" reports a number of 
"data-plane-kafka{-}..." threads consuming ~60% user and ~40% system CPU, and 
the thread dump contains a lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" 
#76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0xa12d7690 
nid=0x20c runnable [0xfffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at 
sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at 
sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x0006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x0006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – 
consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB 
RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 
753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB 
memory pages.

We were able to reproduce the issue in our test environment (which is using 4x 
smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo 
sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy 
Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, [\{events=EPOLLOUT, data={u32=12286, 
u64=468381628432382}}], 1024, 300, NULL, 8) = 1
fstat(12019,

{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, \{st_mode=S_IFREG|0644, st_size=414428357, ...}

) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily 
unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the 
test environment removes the pressure from the broker and it can continue 
normally without restart.

We have found a bug report here that suggests that an issue may be partially 
due to a kernel bug: 
[https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335] 
(they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and it made it harder to 
reproduce the issue, but we can still do it by reducing all the of "tcp_mem" 
parameters by a factor of 1,000. The JVM behavior is the same under these 
conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1.  The issue is 
present on both systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800770#comment-17800770
 ] 

Divij Vaidya commented on KAFKA-16052:
--

We have a org.mockito.internal.verification.DefaultRegisteredInvocations class 
using 590MB of heap. This class maintains a map of all mocks with their 
associated invocations. If we have a mock that is called many many times, the 
map will be very large to track all invocations and hence, cause this heap 
utilizations. 

Now we need to find which test has this mock that has around 767K invocations 
on a mock and that is our culprit.

!Screenshot 2023-12-27 at 14.45.20.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-27 at 14.45.20.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Description: 
*Problem*
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 


*Setup*
To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}

*Result of experiment*
This is how the heap memory utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage. The heap dump is available at 
[https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
 

!Screenshot 2023-12-27 at 14.22.21.png!

  was:
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage. The heap dump is available at 
[https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
 

!Screenshot 2023-12-27 at 14.22.21.png!


> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action 

[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Description: 
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage. The heap dump is available at 
[https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0]
 

!Screenshot 2023-12-27 at 14.22.21.png!

  was:
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage. The heap dump is attached to this 
JIRA.

!Screenshot 2023-12-27 at 14.22.21.png!


> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/

[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Description: 
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage. The heap dump is attached to this 
JIRA.

!Screenshot 2023-12-27 at 14.22.21.png!

  was:
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage.

!Screenshot 2023-12-27 at 14.22.21.png!


> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ?

[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Description: 
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}
This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes. 
Note that the total number of threads also increases but it does not correlate 
with sharp increase in heap memory usage.

!Screenshot 2023-12-27 at 14.22.21.png!

  was:
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}

This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes.

!Screenshot 2023-12-27 at 14.22.21.png!


> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toIn

[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Description: 
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:
{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}

This is how the heap mempry utilized looks like, starting from tens of MB to 
ending with 1.5GB (with spikes of 2GB) of heap being used as the test executes.

!Screenshot 2023-12-27 at 14.22.21.png!

  was:
Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:


{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}


> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git 

[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-27 at 14.22.21.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


divijvaidya commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437034332


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   That is fair. In that case, we should perhaps use `public static void 
closeQuietly(AutoCloseable closeable, String name, AtomicReference 
firstException)` and end the finally with something similar to
   
   ```
   if (firstException != null) {
  throw new KafkaException("Failed to close", exception);
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800764#comment-17800764
 ] 

Divij Vaidya commented on KAFKA-16052:
--

The tests are still running but as of now the root cause looks like a leak 
caused by mockito since it occupies ~950MB of heap.



!Screenshot 2023-12-27 at 14.04.52.png!

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16052:
-
Attachment: Screenshot 2023-12-27 at 14.04.52.png

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png
>
>
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] KAFKA-16053: Fix leaks of KDC server in tests [kafka]

2023-12-27 Thread via GitHub


divijvaidya commented on PR #15079:
URL: https://github.com/apache/kafka/pull/15079#issuecomment-1870288475

   @dajac @ableegoldman please review. This is not the root cause of our recent 
OOM errors but is a contributor to it.
   
   (cc: @stanislavkozlovski as a potential backport to 3.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16053:
-
Attachment: Screenshot 2023-12-27 at 13.18.33.png

> Fix leaked Default DirectoryService
> ---
>
> Key: KAFKA-16053
> URL: https://issues.apache.org/jira/browse/KAFKA-16053
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Divij Vaidya
>Assignee: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 13.18.33.png
>
>
> Heap dump hinted towards a leaked DefaultDirectoryService while running 
> :core:test. It used 123MB of retained memory.
> This Jira fixes the leak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService

2023-12-27 Thread Divij Vaidya (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya updated KAFKA-16053:
-
Description: 
Heap dump hinted towards a leaked DefaultDirectoryService while running 
:core:test. It used 123MB of retained memory.

This Jira fixes the leak.

  was:
Heap dump hinted towards a leaked DefaultDirectoryService while running 
:core:test. It used 240MB of retained memory.

This Jira fixes the leak.


> Fix leaked Default DirectoryService
> ---
>
> Key: KAFKA-16053
> URL: https://issues.apache.org/jira/browse/KAFKA-16053
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Divij Vaidya
>Assignee: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 13.18.33.png
>
>
> Heap dump hinted towards a leaked DefaultDirectoryService while running 
> :core:test. It used 123MB of retained memory.
> This Jira fixes the leak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437026129


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   So, I think I should only swallow in more than 1 instances to get close in 
finally block. Is that what you expect @satishd ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437024347


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   I think what Satish wants is we close both of them without throwing any 
exception to not closing the 2nd instance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


divijvaidya commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1437022630


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   Do we want to close them quietly? This will suppress memory leaks since we 
won't get to know if the close is not happening correctly. It's better to fail 
when close has an error so that we get to know that something is wrong with 
tests,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] KAFKA-16053: Fix leaks of KDC server in tests [kafka]

2023-12-27 Thread via GitHub


divijvaidya opened a new pull request, #15079:
URL: https://github.com/apache/kafka/pull/15079

   # Problem
   We are facing OOM while running test suite for Apache Kafka as discussed in 
https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4 
   
   # Changes
   
   This JIRA fixes ~240MB of heap leak caused by DefaultDirectoryService 
objects in the heap memory. The `DefaultDirectoryService` is used by 
`MiniKdc.scala`. This change fixes a couple of leaks where the kdc service is 
not closed at the end of test run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (KAFKA-16053) Fix leaked Default DirectoryService

2023-12-27 Thread Divij Vaidya (Jira)
Divij Vaidya created KAFKA-16053:


 Summary: Fix leaked Default DirectoryService
 Key: KAFKA-16053
 URL: https://issues.apache.org/jira/browse/KAFKA-16053
 Project: Kafka
  Issue Type: Sub-task
Reporter: Divij Vaidya
Assignee: Divij Vaidya


Heap dump hinted towards a leaked DefaultDirectoryService while running 
:core:test. It used 240MB of retained memory.

This Jira fixes the leak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Divij Vaidya (Jira)
Divij Vaidya created KAFKA-16052:


 Summary: OOM in Kafka test suite
 Key: KAFKA-16052
 URL: https://issues.apache.org/jira/browse/KAFKA-16052
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.7.0
Reporter: Divij Vaidya


Our test suite is failing with frequent OOM. Discussion in the mailing list is 
here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 

To find the source of leaks, I ran the :core:test build target with a single 
thread (see below on how to do it) and attached a profiler to it. This Jira 
tracks the list of action items identified from the analysis.

How to run tests using a single thread:


{code:java}
diff --git a/build.gradle b/build.gradle
index f7abbf4f0b..81df03f1ee 100644
--- a/build.gradle
+++ b/build.gradle
@@ -74,9 +74,8 @@ ext {
       "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
     )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
-  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
maxScalacThreads.toInteger() :
-      Math.min(Runtime.runtime.availableProcessors(), 8)
+  maxTestForks = 1
+  maxScalacThreads = 1
   userIgnoreFailures = project.hasProperty('ignoreFailures') ? ignoreFailures 
: false   userMaxTestRetries = project.hasProperty('maxTestRetries') ? 
maxTestRetries.toInteger() : 0
diff --git a/gradle.properties b/gradle.properties
index 4880248cac..ee4b6e3bc1 100644
--- a/gradle.properties
+++ b/gradle.properties
@@ -30,4 +30,4 @@ scalaVersion=2.13.12
 swaggerVersion=2.2.8
 task=build
 org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
-org.gradle.parallel=true
+org.gradle.parallel=false {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on PR #15077:
URL: https://github.com/apache/kafka/pull/15077#issuecomment-1870263212

   @satishd , I've closed all the instances in finally block quietly in this 
commit: 
https://github.com/apache/kafka/pull/15077/commits/dd913a8668cf773a51403e482cc704b44cb0e8a1
 . Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: New year code cleanup - include final keyword [kafka]

2023-12-27 Thread via GitHub


vamossagar12 commented on PR #15072:
URL: https://github.com/apache/kafka/pull/15072#issuecomment-1870244078

   Thanks @divijvaidya , I think Ismael's point is correct. From what I recall, 
the streams final rule also expects the method arguements etc to be final as 
well (apart from local variables within methods). Enabling that rule would be a 
big change I guess. Also, I see only variables and parameters 
[here](https://checkstyle.sourceforge.io/checks/coding/finallocalvariable.html#FinalLocalVariable)
 for tokens to check. I think we can leave it to this but the only problem is, 
there won't be any such enforcements for using final variables and might need 
cleanups in the future. That should be ok IMO.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


satishd commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1436993064


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -4020,6 +4031,7 @@ class ReplicaManagerTest {
   doneLatch.countDown()
 } finally {
   replicaManager.shutdown(checkpointHW = false)
+  remoteLogManager.close()

Review Comment:
   Can you close/shutdown safely by swallowing any intermediate exceptions 
occur as mentioned below?
   ```
 CoreUtils.swallow(replicaManager.shutdown(checkpointHW = false), this)
 CoreUtils.swallow(remoteLogManager.close(), this)
   ```



##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2704,28 +2710,30 @@ class ReplicaManagerTest {
   time = time,
   scheduler = time.scheduler,
   logManager = logManager,
-  quotaManagers = QuotaFactory.instantiate(config, metrics, time, ""),
+  quotaManagers = quotaManager,
   metadataCache = MetadataCache.zkMetadataCache(config.brokerId, 
config.interBrokerProtocolVersion),
   logDirFailureChannel = new LogDirFailureChannel(config.logDirs.size),
   alterPartitionManager = alterPartitionManager,
   threadNamePrefix = Option(this.getClass.getName))
 
-logManager.startup(Set.empty[String])
-
-// Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
-val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
-createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
-createStrayLogs(10, logManager)
+try {
+  logManager.startup(Set.empty[String])
 
-val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
+  // Create a hosted topic, a hosted topic that will become stray, and a 
stray topic
+  val validLogs = createHostedLogs("hosted-topic", numLogs = 2, 
replicaManager).toSet
+  createHostedLogs("hosted-stray", numLogs = 10, replicaManager).toSet
+  createStrayLogs(10, logManager)
 
-
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
+  val allReplicasFromLISR = Set(new TopicPartition("hosted-topic", 0), new 
TopicPartition("hosted-topic", 1))
 
-assertEquals(validLogs, logManager.allLogs.toSet)
-assertEquals(validLogs.size, replicaManager.partitionCount.value)
+  
replicaManager.updateStrayLogs(replicaManager.findStrayPartitionsFromLeaderAndIsr(allReplicasFromLISR))
 
-replicaManager.shutdown()
-logManager.shutdown()
+  assertEquals(validLogs, logManager.allLogs.toSet)
+  assertEquals(validLogs.size, replicaManager.partitionCount.value)
+} finally {
+  replicaManager.shutdown()
+  logManager.shutdown()

Review Comment:
   Can you close both of them safely like below ignoring any intermediate 
failures?
   
   ```
 CoreUtils.swallow(replicaManager.shutdown(checkpointHW = false), this)
 CoreUtils.swallow(remoteLogManager.close(), this)
   
   ```



##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -4116,6 +4128,7 @@ class ReplicaManagerTest {
   latch.countDown()
 } finally {
   replicaManager.shutdown(checkpointHW = false)
+  remoteLogManager.close()

Review Comment:
   Please close safely as mentioned in the earlier comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: close leaking threads in replica manager tests [kafka]

2023-12-27 Thread via GitHub


showuon commented on code in PR #15077:
URL: https://github.com/apache/kafka/pull/15077#discussion_r1436972102


##
core/src/test/scala/unit/kafka/server/ReplicaManagerTest.scala:
##
@@ -2693,6 +2693,9 @@ class ReplicaManagerTest {
 } else {
   assertTrue(stray0.isInstanceOf[HostedPartition.Online])
 }
+
+replicaManager.shutdown()
+logManager.shutdown()

Review Comment:
   You're right! Updated. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: New year code cleanup - include final keyword [kafka]

2023-12-27 Thread via GitHub


divijvaidya commented on PR #15072:
URL: https://github.com/apache/kafka/pull/15072#issuecomment-1870156605

   Checkstyle doesn't have a rule [1] available to enforce that fields which 
are not changing are marked as final. Hence, I am not changing anything in the 
checkstyle here in this PR.
   
   [1] https://checkstyle.sourceforge.io/checks.html 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   >