[jira] [Resolved] (KAFKA-17924) Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total`
[ https://issues.apache.org/jira/browse/KAFKA-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-17924. -- Resolution: Fixed > Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total` > > > Key: KAFKA-17924 > URL: https://issues.apache.org/jira/browse/KAFKA-17924 > Project: Kafka > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Jhen-Yung Hsu >Priority: Major > Labels: breaking > Fix For: 4.0.0 > > > They are deprecated over 3 years -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-17924) Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total`
[ https://issues.apache.org/jira/browse/KAFKA-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-17924: - Labels: breaking (was: ) > Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total` > > > Key: KAFKA-17924 > URL: https://issues.apache.org/jira/browse/KAFKA-17924 > Project: Kafka > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Jhen-Yung Hsu >Priority: Major > Labels: breaking > Fix For: 4.0.0 > > > They are deprecated over 3 years -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-17924) Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total`
[ https://issues.apache.org/jira/browse/KAFKA-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-17924: - Fix Version/s: 4.0.0 > Remove `bufferpool-wait-time-total`, `io-waittime-total`, and `iotime-total` > > > Key: KAFKA-17924 > URL: https://issues.apache.org/jira/browse/KAFKA-17924 > Project: Kafka > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Jhen-Yung Hsu >Priority: Major > Fix For: 4.0.0 > > > They are deprecated over 3 years -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-17062) RemoteLogManager - RemoteStorageException causes data loss
[ https://issues.apache.org/jira/browse/KAFKA-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862439#comment-17862439 ] Divij Vaidya commented on KAFKA-17062: -- I haven't looked at it in great detail but the bug sounds legitimate. For a fix, we need to filter the segment based on their state at [https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L1184] and only delete segments which are in copy_finished and/or delete_started state. Note that inclusion of copy_started segments was intentional as per [https://github.com/apache/kafka/pull/13561#discussion_r1181527119] but we probably missed that the calculated retention size will include all previous copy_started copies of the same segment. cc: [~showuon] [~Kamal C] [~satishd] > RemoteLogManager - RemoteStorageException causes data loss > -- > > Key: KAFKA-17062 > URL: https://issues.apache.org/jira/browse/KAFKA-17062 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Reporter: Guillaume Mallet >Priority: Major > Fix For: 3.7.0, 3.8.0, 3.7.1, 3.9.0 > > > When Tiered Storage is configured, retention.bytes defines the limit for the > amount of data stored in the filesystem and in remote storage. However a > failure while offloading to remote storage can cause segments to be dropped > before the retention limit is met. > What happens > Assuming a topic configured with {{retention.bytes=4294967296}} (4GB) and a > {{local.retention.bytes=1073741824}} (1GB, equal to segment.bytes) we would > expect Kafka to keep up to 3 segments (3GB) in the remote store and 1 segment > locally (the local segment) and possibly more if the remote storage is > offline. i.e. segments in the following RemoteLogSegmentStates in the > RemoteLogMetadataManager (RLMM) : > * Segment 3 ({{{}COPY_SEGMENT_FINISHED{}}}) > * Segment 2 ({{{}COPY_SEGMENT_FINISHED{}}}) > * Segment 1 ({{{}COPY_SEGMENT_FINISHED{}}}) > Let's assume the RLMM starts failing when segment 4 rolls. At the first > iteration of an RLMTask we will have - > * > [{{copyLogSegmentsToRemote}}|https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L773] > : is called first > ** RLMM becomes aware of Segment 4 and adds it to the metadata: > *** Segment 4 ({{{}COPY_SEGMENT_STARTED{}}}), > *** Segment 3 ({{{}COPY_SEGMENT_FINISHED{}}}), > *** Segment 2 ({{{}COPY_SEGMENT_FINISHED{}}}), > *** Segment 1 ({{{}COPY_SEGMENT_FINISHED{}}}) > ** An exception is raised during the copy operation > ([{{copyLogSegmentData}}|https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/storage/api/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteStorageManager.java#L93] > in RemoteStorageManager) which is caught with the error message “{{Error > occurred while copying log segments of partition}}” and no further copy will > be attempted for the duration of this RLMTask. > ** At that point the Segment will never move to {{COPY_SEGMENT_FINISHED}} > but will transition to {{DELETE_SEGMENT_STARTED}} eventually before being > cleaned up when the associated segment is deleted. > * > [{{cleanupExpiredRemoteLogSegments}}|https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L1122] > is then called > ** Retention size is computed in > [{{buildRetentionSizeData}}|https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/core/src/main/java/kafka/log/remote/RemoteLogManager.java#L1296] > as the sum of all the segments size regardless of their state so computed > size of the topic is 1 (local) + 4 (remote) > ** Segment 1 as being the oldest will be dropped. > At the second iteration after > [{{remote.log.manager.task.interval.ms}}|https://github.com/apache/kafka/blob/d0dfefbe6394276eb329b6ca998842a984add506/storage/src/main/java/org/apache/kafka/server/log/remote/storage/RemoteLogManagerConfig.java#L395] > (default: 30s), the same will happen. The RLMM will now have 2 x Segment 4 > in a {{COPY_SEGMENT_STARTED}} state each with a different > {{RemoteLogSegmentId}} and Segment 2 will be dropped. The same will happen to > Segment 3 after another iteration. > At that point, we now have the RLMM composed of 4 copies of Segment 4 in > {{COPY_SEGMENT_STARTED}} state. Segment 4 is marked for deletion increasing > the LSO at the same time and causing the UnifiedLog to delete the local and > remote data for Segment 4 including its metadata. > Under those circumstances Kafka can quickly delete segments that were not > meant for deletion causing a data loss. > Steps to
[jira] [Assigned] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya reassigned KAFKA-16052: Assignee: Divij Vaidya Resolution: Fixed > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.8.0 > > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, Screenshot 2024-01-10 at 14.59.47.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16385) Segment is rolled before segment.ms or segment.bytes breached
[ https://issues.apache.org/jira/browse/KAFKA-16385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828247#comment-17828247 ] Divij Vaidya commented on KAFKA-16385: -- [~showuon] I must be missing something here but the current behaviour looks correct to me. Let's consider a use case from a Apache Kafka user: I have set max segment size to be 1 GB and I have a topic with low ingress traffic. I want to expire data in my log every 1 day due to compliance requirement. But the partition doesn't receive 1GB of data in one day and hence, my active segment will never become eligible for expiration. Now, user can set segment.ms = 1 day to force a rotation even when segment is not full. This should satisfy the use case. But how do we define the behaviour when expiration configuration is less than roll configuration. We have have two options: Option 1: Ignore expiration config if it is less than rotation config Option 2: Expiration config overrides rotation config Option 1 prioritizes an internal configuration (ideally a user shouldn't know about segments etc in a log) over a functional config (user wants to expire data). This requires users to know about inner details of logs such as presence of a segment or index etc. At Apache Kafka, we have chosen option 2, i.e. prioritize a user facing functionality config (expiration config) over an internal config (rotation config). Thoughts? > Segment is rolled before segment.ms or segment.bytes breached > - > > Key: KAFKA-16385 > URL: https://issues.apache.org/jira/browse/KAFKA-16385 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.5.1, 3.7.0 >Reporter: Luke Chen >Assignee: Kuan Po Tseng >Priority: Major > > Steps to reproduce: > 0. Startup a broker with `log.retention.check.interval.ms=1000` to speed up > the test. > 1. Creating a topic with the config: segment.ms=7days , retention.ms=1sec . > 2. Send a record "aaa" to the topic > 3. Wait for 1 second > Will this segment will rolled? I thought no. > But what I have tested is it will roll: > {code:java} > [2024-03-19 15:23:13,924] INFO [LocalLog partition=t2-1, > dir=/tmp/kafka-logs_jbod] Rolled new log segment at offset 1 in 3 ms. > (kafka.log.LocalLog) > [2024-03-19 15:23:13,925] INFO [ProducerStateManager partition=t2-1] Wrote > producer snapshot at offset 1 with 1 producer ids in 1 ms. > (org.apache.kafka.storage.internals.log.ProducerStateManager) > [2024-03-19 15:23:13,925] INFO [UnifiedLog partition=t2-1, > dir=/tmp/kafka-logs_jbod] Deleting segment LogSegment(baseOffset=0, size=71, > lastModifiedTime=1710832993131, largestRecordTimestamp=1710832992125) due to > log retention time 1000ms breach based on the largest record timestamp in the > segment (kafka.log.UnifiedLog) > {code} > The segment is rolled due to log retention time 1000ms breached, which is > unexpected. > Tested in v3.5.1, it has the same issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16073) Kafka Tiered Storage: Consumer Fetch Error Due to Delayed localLogStartOffset Update During Segment Deletion
[ https://issues.apache.org/jira/browse/KAFKA-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16073: - Fix Version/s: 3.6.2 (was: 3.6.1) > Kafka Tiered Storage: Consumer Fetch Error Due to Delayed localLogStartOffset > Update During Segment Deletion > > > Key: KAFKA-16073 > URL: https://issues.apache.org/jira/browse/KAFKA-16073 > Project: Kafka > Issue Type: Bug > Components: core, Tiered-Storage >Affects Versions: 3.6.1 >Reporter: hzh0425 >Assignee: hzh0425 >Priority: Major > Labels: KIP-405, kip-405, tiered-storage > Fix For: 3.6.2, 3.8.0 > > > The identified bug in Apache Kafka's tiered storage feature involves a > delayed update of {{localLogStartOffset}} in the > {{UnifiedLog.deleteSegments}} method, impacting consumer fetch operations. > When segments are deleted from the log's memory state, the > {{localLogStartOffset}} isn't promptly updated. Concurrently, > {{ReplicaManager.handleOffsetOutOfRangeError}} checks if a consumer's fetch > offset is less than the {{{}localLogStartOffset{}}}. If it's greater, Kafka > erroneously sends an {{OffsetOutOfRangeException}} to the consumer. > In a specific concurrent scenario, imagine sequential offsets: {{{}offset1 < > offset2 < offset3{}}}. A client requests data at {{{}offset2{}}}. While a > background deletion process removes segments from memory, it hasn't yet > updated the {{LocalLogStartOffset}} from {{offset1}} to {{{}offset3{}}}. > Consequently, when the fetch offset ({{{}offset2{}}}) is evaluated against > the stale {{offset1}} in {{{}ReplicaManager.handleOffsetOutOfRangeError{}}}, > it incorrectly triggers an {{{}OffsetOutOfRangeException{}}}. This issue > arises from the out-of-sync update of {{{}localLogStartOffset{}}}, leading to > incorrect handling of consumer fetch requests and potential data access > errors. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16368) Change constraints and default values for various configurations
Divij Vaidya created KAFKA-16368: Summary: Change constraints and default values for various configurations Key: KAFKA-16368 URL: https://issues.apache.org/jira/browse/KAFKA-16368 Project: Kafka Issue Type: Improvement Reporter: Divij Vaidya Fix For: 4.0.0 This Jira is a parent item to track all the defaults and/or constraints that we would like to change with Kafka 4.0. This Jira will be associated with a KIP. Currently, we are gathering feedback from the community on the configurations that don't have sane defaults. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16114) Fix partiton not retention after cancel alter intra broker log dir task
[ https://issues.apache.org/jira/browse/KAFKA-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825363#comment-17825363 ] Divij Vaidya commented on KAFKA-16114: -- Sorry [~albedooo] , I won't have bandwidth any time soon to look into this. > Fix partiton not retention after cancel alter intra broker log dir task > > > Key: KAFKA-16114 > URL: https://issues.apache.org/jira/browse/KAFKA-16114 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 3.3.2, 3.6.1 >Reporter: wangliucheng >Priority: Major > > The deletion thread will not work on partition after cancel alter intra > broker log dir task > The steps to reproduce are as follows: > 1、Create reassignment.json file > test01-1 on the /data01/kafka/log01 directory of the broker 1003,then move to > /data01/kafka/log02 > {code:java} > { > "version": 1, > "partitions": [ > { > "topic": "test01", > "partition": 1, > "replicas": [1001,1003], > "log_dirs": ["any","/data01/kafka/log02"] > } > ] > }{code} > 2、Kick off the reassignment > {code:java} > bin/kafka-reassign-partitions.sh -bootstrap-server localhost:9092 > --reassignment-json-file reassignment.json -execute {code} > 3、Cancel the reassignment > {code:java} > bin/kafka-reassign-partitions.sh -bootstrap-server localhost:9092 > --reassignment-json-file reassignment.json -cancel {code} > 4、Result, The partition test01-1 on 1003 will not be deleted > The reason for this problem is the partition has been filtered: > {code:java} > val deletableLogs = logs.filter { > case (_, log) => !log.config.compact // pick non-compacted logs > }.filterNot { > case (topicPartition, _) => inProgress.contains(topicPartition) // skip any > logs already in-progress > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16126) Kcontroller dynamic configurations may fail to apply at startup
[ https://issues.apache.org/jira/browse/KAFKA-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16126: - Component/s: kraft > Kcontroller dynamic configurations may fail to apply at startup > --- > > Key: KAFKA-16126 > URL: https://issues.apache.org/jira/browse/KAFKA-16126 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.7.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Blocker > Fix For: 3.7.0, 3.6.2 > > > Some kcontroller dynamic configurations may fail to apply at startup. This > happens because there is a race between registering the reconfigurables to > the DynamicBrokerConfig class, and receiving the first update from the > metadata publisher. We can fix this by registering the reconfigurables first. > This seems to have been introduced by the "MINOR: Install ControllerServer > metadata publishers sooner" change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16126) Kcontroller dynamic configurations may fail to apply at startup
[ https://issues.apache.org/jira/browse/KAFKA-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-16126. -- Resolution: Fixed > Kcontroller dynamic configurations may fail to apply at startup > --- > > Key: KAFKA-16126 > URL: https://issues.apache.org/jira/browse/KAFKA-16126 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Blocker > Fix For: 3.6.2, 3.7.0 > > > Some kcontroller dynamic configurations may fail to apply at startup. This > happens because there is a race between registering the reconfigurables to > the DynamicBrokerConfig class, and receiving the first update from the > metadata publisher. We can fix this by registering the reconfigurables first. > This seems to have been introduced by the "MINOR: Install ControllerServer > metadata publishers sooner" change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16126) Kcontroller dynamic configurations may fail to apply at startup
[ https://issues.apache.org/jira/browse/KAFKA-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16126: - Fix Version/s: (was: 3.8.0) > Kcontroller dynamic configurations may fail to apply at startup > --- > > Key: KAFKA-16126 > URL: https://issues.apache.org/jira/browse/KAFKA-16126 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Blocker > Fix For: 3.7.0, 3.6.2 > > > Some kcontroller dynamic configurations may fail to apply at startup. This > happens because there is a race between registering the reconfigurables to > the DynamicBrokerConfig class, and receiving the first update from the > metadata publisher. We can fix this by registering the reconfigurables first. > This seems to have been introduced by the "MINOR: Install ControllerServer > metadata publishers sooner" change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16126) Kcontroller dynamic configurations may fail to apply at startup
[ https://issues.apache.org/jira/browse/KAFKA-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824828#comment-17824828 ] Divij Vaidya commented on KAFKA-16126: -- Seems like the associated commit in the PR was merged to 3.6, 3.7 and trunk. 3.6 -[https://github.com/apache/kafka/commit/b743f6fd884132c7a5c4e9d96ed62e3aec29007f] 3.7 - [https://github.com/apache/kafka/commit/b40368330814888d7f7f2fda3f5b7ecfa1eabeb2] trunk - [https://github.com/apache/kafka/commit/0015d0f01b130992acc37da85da6ee2088186a1f] I am correcting the fix version here and closing this ticket. > Kcontroller dynamic configurations may fail to apply at startup > --- > > Key: KAFKA-16126 > URL: https://issues.apache.org/jira/browse/KAFKA-16126 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Blocker > > Some kcontroller dynamic configurations may fail to apply at startup. This > happens because there is a race between registering the reconfigurables to > the DynamicBrokerConfig class, and receiving the first update from the > metadata publisher. We can fix this by registering the reconfigurables first. > This seems to have been introduced by the "MINOR: Install ControllerServer > metadata publishers sooner" change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16126) Kcontroller dynamic configurations may fail to apply at startup
[ https://issues.apache.org/jira/browse/KAFKA-16126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16126: - Fix Version/s: 3.6.2 3.8.0 3.7.0 > Kcontroller dynamic configurations may fail to apply at startup > --- > > Key: KAFKA-16126 > URL: https://issues.apache.org/jira/browse/KAFKA-16126 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Blocker > Fix For: 3.7.0, 3.6.2, 3.8.0 > > > Some kcontroller dynamic configurations may fail to apply at startup. This > happens because there is a race between registering the reconfigurables to > the DynamicBrokerConfig class, and receiving the first update from the > metadata publisher. We can fix this by registering the reconfigurables first. > This seems to have been introduced by the "MINOR: Install ControllerServer > metadata publishers sooner" change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya reassigned KAFKA-15490: Assignee: Divij Vaidya > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.4.0, 3.4.1, 3.5.1, 3.6.1 >Reporter: Alexandre Dupriez >Assignee: Divij Vaidya >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. > For brokers with only one log directory, this bug will result in preventing > the broker from shutting down as expected. > The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler > thread, and subsequent {{IOException}} are not handled, and the broker never > stops. > {code:java} > [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped > (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Another consideration here is whether the {{LogDirNotFoundException}} should > terminate the log dir failure handler thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15490) Invalid path provided to the log failure channel upon I/O error when writing broker metadata checkpoint
[ https://issues.apache.org/jira/browse/KAFKA-15490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-15490: - Affects Version/s: 3.6.1 > Invalid path provided to the log failure channel upon I/O error when writing > broker metadata checkpoint > --- > > Key: KAFKA-15490 > URL: https://issues.apache.org/jira/browse/KAFKA-15490 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.4.0, 3.4.1, 3.5.1, 3.6.1 >Reporter: Alexandre Dupriez >Priority: Minor > > There is a small bug/typo in the handling of I/O error when writing broker > metadata checkpoint in {{{}KafkaServer{}}}. The path provided to the log dir > failure channel is the full path of the checkpoint file whereas only the log > directory is expected > ([source|https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/server/KafkaServer.scala#L958C8-L961C8]). > {code:java} > case e: IOException => >val dirPath = checkpoint.file.getAbsolutePath >logDirFailureChannel.maybeAddOfflineLogDir(dirPath, s"Error while writing > meta.properties to $dirPath", e){code} > As a result, after an {{IOException}} is captured and enqueued in the log dir > failure channel ({{{}{}}} is to be replaced with the actual path of > the log directory): > {code:java} > [2023-09-22 17:07:32,052] ERROR Error while writing meta.properties to > /meta.properties (kafka.server.LogDirFailureChannel) > java.io.IOException{code} > The log dir failure handler cannot lookup the log directory: > {code:java} > [2023-09-22 17:07:32,053] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler) > org.apache.kafka.common.errors.LogDirNotFoundException: Log dir > /meta.properties is not found in the config.{code} > An immediate fix for this is to use the {{logDir}} provided from to the > checkpointing method instead of the path of the metadata file. > For brokers with only one log directory, this bug will result in preventing > the broker from shutting down as expected. > The L{{{}ogDirNotFoundException{}}} then kills the log dir failure handler > thread, and subsequent {{IOException}} are not handled, and the broker never > stops. > {code:java} > [2024-02-27 02:13:13,564] INFO [LogDirFailureHandler]: Stopped > (kafka.server.ReplicaManager$LogDirFailureHandler){code} > Another consideration here is whether the {{LogDirNotFoundException}} should > terminate the log dir failure handler thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16325) Add missing producer metrics to documentation
Divij Vaidya created KAFKA-16325: Summary: Add missing producer metrics to documentation Key: KAFKA-16325 URL: https://issues.apache.org/jira/browse/KAFKA-16325 Project: Kafka Issue Type: Improvement Components: documentation, website Reporter: Divij Vaidya Some producer metrics such as buffer-exhausted-rate [1]are missing from the documentation at [https://kafka.apache.org/documentation.html#producer_monitoring] Hence, users of Kafka sometimes don't know about these metrics at all. This task will add these (and possibly any other missing) metrics to the documentation. An example of a similar PR where metrics were added to the documentation is at [https://github.com/apache/kafka/pull/12934] [1] [https://github.com/apache/kafka/blob/c254b22a4877e70617b2710b95ef44b8cc55ce97/clients/src/main/java/org/apache/kafka/clients/producer/internals/BufferPool.java#L91] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818562#comment-17818562 ] Divij Vaidya edited comment on KAFKA-16278 at 2/19/24 7:48 PM: --- Sure [~anton.liauchuk] . You should be able to assign his ticket to yourself now. P.S. - In future, feel free to assign any "unassigned" Jira ticket to yourself (by changing the Assignee) and start working on it. You don't have to ask for a permission. was (Author: divijvaidya): Sure [~anton.liauchuk] . You should be able to assign his ticket to yourself now. > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Assignee: Anton Liauchuk >Priority: Blocker > Labels: newbie > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818562#comment-17818562 ] Divij Vaidya commented on KAFKA-16278: -- Sure [~anton.liauchuk] . You should be able to assign his ticket to yourself now. > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Priority: Blocker > Labels: newbie > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16278: - Fix Version/s: 3.8.0 > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Priority: Blocker > Labels: newbie > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16278: - Fix Version/s: 3.6.2 3.7.1 > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Priority: Blocker > Labels: newbie > Fix For: 3.6.2, 3.7.1 > > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16278: - Priority: Blocker (was: Major) > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Priority: Blocker > Labels: newbie > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16278) Missing license for scala related dependencies
[ https://issues.apache.org/jira/browse/KAFKA-16278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16278: - Description: We are missing the license for following dependency in [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] scala-collection-compat_2.12-2.10.0 is missing in license file scala-java8-compat_2.12-1.0.2 is missing in license file scala-library-2.12.18 is missing in license file scala-logging_2.12-3.9.4 is missing in license file scala-reflect-2.12.18 is missing in license file The objective of this task is to add these dependencies in the LICENSE-binary file. (please backport to 3.6 and 3.7 branches) was: We are missing the license for following dependency in [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] scala-collection-compat_2.12-2.10.0 is missing in license file scala-java8-compat_2.12-1.0.2 is missing in license file scala-library-2.12.18 is missing in license file scala-logging_2.12-3.9.4 is missing in license file scala-reflect-2.12.18 is missing in license file The objective of this task is to add these dependencies in the LICENSE-binary file. > Missing license for scala related dependencies > --- > > Key: KAFKA-16278 > URL: https://issues.apache.org/jira/browse/KAFKA-16278 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0, 3.6.1 >Reporter: Divij Vaidya >Priority: Major > Labels: newbie > > We are missing the license for following dependency in > [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] > > scala-collection-compat_2.12-2.10.0 is missing in license file > scala-java8-compat_2.12-1.0.2 is missing in license file > scala-library-2.12.18 is missing in license file > scala-logging_2.12-3.9.4 is missing in license file > scala-reflect-2.12.18 is missing in license file > The objective of this task is to add these dependencies in the LICENSE-binary > file. > (please backport to 3.6 and 3.7 branches) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-12622) Automate LICENSE file validation
[ https://issues.apache.org/jira/browse/KAFKA-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17818559#comment-17818559 ] Divij Vaidya commented on KAFKA-12622: -- *Update for release managers* Please check for correct licenses in both binaries (kafka_2.13 and kafka2.12). > Automate LICENSE file validation > > > Key: KAFKA-12622 > URL: https://issues.apache.org/jira/browse/KAFKA-12622 > Project: Kafka > Issue Type: Task >Reporter: John Roesler >Priority: Major > Fix For: 3.8.0 > > > In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed > a correct license file for 2.8.0. This file will certainly become wrong again > in later releases, so we need to write some kind of script to automate a > check. > It crossed my mind to automate the generation of the file, but it seems to be > an intractable problem, considering that each dependency may change licenses, > may package license files, link to them from their poms, link to them from > their repos, etc. I've also found multiple URLs listed with various > delimiters, broken links that I have to chase down, etc. > Therefore, it seems like the solution to aim for is simply: list all the jars > that we package, and print out a report of each jar that's extra or missing > vs. the ones in our `LICENSE-binary` file. > The check should be part of the release script at least, if not part of the > regular build (so we keep it up to date as dependencies change). > > Here's how I do this manually right now: > {code:java} > // build the binary artifacts > $ ./gradlewAll releaseTarGz > // unpack the binary artifact > $ tar xf core/build/distributions/kafka_2.13-X.Y.Z.tgz > $ cd xf kafka_2.13-X.Y.Z > // list the packaged jars > // (you can ignore the jars for our own modules, like kafka, kafka-clients, > etc.) > $ ls libs/ > // cross check the jars with the packaged LICENSE > // make sure all dependencies are listed with the right versions > $ cat LICENSE > // also double check all the mentioned license files are present > $ ls licenses {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16278) Missing license for scala related dependencies
Divij Vaidya created KAFKA-16278: Summary: Missing license for scala related dependencies Key: KAFKA-16278 URL: https://issues.apache.org/jira/browse/KAFKA-16278 Project: Kafka Issue Type: Bug Affects Versions: 3.6.1, 3.7.0 Reporter: Divij Vaidya We are missing the license for following dependency in [https://github.com/apache/kafka/blob/b71999be95325f6ea54e925cbe5b426425781014/LICENSE-binary#L261] scala-collection-compat_2.12-2.10.0 is missing in license file scala-java8-compat_2.12-1.0.2 is missing in license file scala-library-2.12.18 is missing in license file scala-logging_2.12-3.9.4 is missing in license file scala-reflect-2.12.18 is missing in license file The objective of this task is to add these dependencies in the LICENSE-binary file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16239) Clean up references to non-existent IntegrationTestHelper
[ https://issues.apache.org/jira/browse/KAFKA-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-16239. -- Resolution: Fixed > Clean up references to non-existent IntegrationTestHelper > - > > Key: KAFKA-16239 > URL: https://issues.apache.org/jira/browse/KAFKA-16239 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Minor > Labels: newbie > Fix For: 3.8.0 > > > A bunch of places in the code javadocs and READ docs refer to a class called > IntegrationTestHelper. Such a class does not exist. > This task will clean up all referenced to IntegrationTestHelper from Kafka > code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16239) Clean up references to non-existent IntegrationTestHelper
[ https://issues.apache.org/jira/browse/KAFKA-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16239: - Component/s: unit tests > Clean up references to non-existent IntegrationTestHelper > - > > Key: KAFKA-16239 > URL: https://issues.apache.org/jira/browse/KAFKA-16239 > Project: Kafka > Issue Type: Improvement > Components: unit tests >Reporter: Divij Vaidya >Priority: Minor > Labels: newbie > Fix For: 3.8.0 > > > A bunch of places in the code javadocs and READ docs refer to a class called > IntegrationTestHelper. Such a class does not exist. > This task will clean up all referenced to IntegrationTestHelper from Kafka > code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16239) Clean up references to non-existent IntegrationTestHelper
[ https://issues.apache.org/jira/browse/KAFKA-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16239: - Fix Version/s: 3.8.0 > Clean up references to non-existent IntegrationTestHelper > - > > Key: KAFKA-16239 > URL: https://issues.apache.org/jira/browse/KAFKA-16239 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Minor > Labels: newbie > Fix For: 3.8.0 > > > A bunch of places in the code javadocs and READ docs refer to a class called > IntegrationTestHelper. Such a class does not exist. > This task will clean up all referenced to IntegrationTestHelper from Kafka > code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14041) Avoid the keyword var for a variable declaration in ConfigTransformer
[ https://issues.apache.org/jira/browse/KAFKA-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-14041. -- Resolution: Fixed > Avoid the keyword var for a variable declaration in ConfigTransformer > - > > Key: KAFKA-14041 > URL: https://issues.apache.org/jira/browse/KAFKA-14041 > Project: Kafka > Issue Type: Improvement > Components: clients >Reporter: QualiteSys QualiteSys >Assignee: Andrew Schofield >Priority: Major > Fix For: 3.8.0 > > > In the file > clients\src\main\java\org\apache\kafka\common\config\ConfigTransformer.java a > variable named var is declared : > line 84 : for (ConfigVariable var : vars) { > Since it is a java keyword, could the variable name be changed ? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14041) Avoid the keyword var for a variable declaration in ConfigTransformer
[ https://issues.apache.org/jira/browse/KAFKA-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-14041: - Priority: Minor (was: Major) > Avoid the keyword var for a variable declaration in ConfigTransformer > - > > Key: KAFKA-14041 > URL: https://issues.apache.org/jira/browse/KAFKA-14041 > Project: Kafka > Issue Type: Improvement > Components: clients >Reporter: QualiteSys QualiteSys >Assignee: Andrew Schofield >Priority: Minor > Fix For: 3.8.0 > > > In the file > clients\src\main\java\org\apache\kafka\common\config\ConfigTransformer.java a > variable named var is declared : > line 84 : for (ConfigVariable var : vars) { > Since it is a java keyword, could the variable name be changed ? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14041) Avoid the keyword var for a variable declaration in ConfigTransformer
[ https://issues.apache.org/jira/browse/KAFKA-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-14041: - Fix Version/s: 3.8.0 > Avoid the keyword var for a variable declaration in ConfigTransformer > - > > Key: KAFKA-14041 > URL: https://issues.apache.org/jira/browse/KAFKA-14041 > Project: Kafka > Issue Type: Improvement > Components: clients >Reporter: QualiteSys QualiteSys >Assignee: Andrew Schofield >Priority: Major > Fix For: 3.8.0 > > > In the file > clients\src\main\java\org\apache\kafka\common\config\ConfigTransformer.java a > variable named var is declared : > line 84 : for (ConfigVariable var : vars) { > Since it is a java keyword, could the variable name be changed ? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-13835) Fix two bugs related to dynamic broker configs in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-13835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816049#comment-17816049 ] Divij Vaidya commented on KAFKA-13835: -- [~cmccabe] the associated PR is merged. Can you please add the correct fix version and resolve this if you feel that this Jira is complete. > Fix two bugs related to dynamic broker configs in KRaft > --- > > Key: KAFKA-13835 > URL: https://issues.apache.org/jira/browse/KAFKA-13835 > Project: Kafka > Issue Type: Bug >Reporter: Colin McCabe >Assignee: Colin McCabe >Priority: Critical > Labels: 4.0-blocker > > The first bug is that we were calling reloadUpdatedFilesWithoutConfigChange > when a topic configuration was changed, but not when a broker configuration > was changed. This was backwards -- this function must be called only for > BROKER configs, and never for TOPIC configs. (Also, this function is called > only for specific broker configs, not for cluster configs.) > The second bug is that there were several configurations such as > `max.connections` which were related to broker listeners, but which did not > involve creating or removing new listeners. We can and should support these > configurations in KRaft, since no additional work is needed to support them. > Only adding or removing listeners is unsupported. This PR adds support for > these by fixing the configuration change validation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-12670) KRaft support for unclean.leader.election.enable
[ https://issues.apache.org/jira/browse/KAFKA-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-12670: - Component/s: kraft > KRaft support for unclean.leader.election.enable > > > Key: KAFKA-12670 > URL: https://issues.apache.org/jira/browse/KAFKA-12670 > Project: Kafka > Issue Type: Task > Components: kraft >Reporter: Colin McCabe >Assignee: Ryan Dielhenn >Priority: Major > > Implement KRaft support for the unclean.leader.election.enable > configurations. These configurations can be set at the topic, broker, or > cluster level. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-14349) Support dynamically resizing the KRaft controller's thread pools
[ https://issues.apache.org/jira/browse/KAFKA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816045#comment-17816045 ] Divij Vaidya edited comment on KAFKA-14349 at 2/9/24 12:52 PM: --- [~cmccabe] might have forgot to close this since it's still open. As per [https://github.com/apache/kafka/blob/092dc7fc467ed7d354ec504d6939b3fcd7b80632/core/src/main/scala/kafka/server/DynamicBrokerConfig.scala#L298] (I might be wrong) but we are only reconfiguring the IO thread pool and not the network thread pool. Let's wait for [~cmccabe] to clarify this. was (Author: divijvaidya): [~cmccabe] might have forgot to close this since it's still open. I am closing this. I have verified that controller threads pools can be dynamically resized as per https://github.com/apache/kafka/blob/092dc7fc467ed7d354ec504d6939b3fcd7b80632/core/src/main/scala/kafka/server/DynamicBrokerConfig.scala#L298 > Support dynamically resizing the KRaft controller's thread pools > > > Key: KAFKA-14349 > URL: https://issues.apache.org/jira/browse/KAFKA-14349 > Project: Kafka > Issue Type: Improvement >Reporter: Colin McCabe >Priority: Major > Labels: 4.0-blocker, kip-500 > > Support dynamically resizing the KRaft controller's request handler and > network handler thread pools. See {{DynamicBrokerConfig.scala}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14349) Support dynamically resizing the KRaft controller's thread pools
[ https://issues.apache.org/jira/browse/KAFKA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816045#comment-17816045 ] Divij Vaidya commented on KAFKA-14349: -- [~cmccabe] might have forgot to close this since it's still open. I am closing this. I have verified that controller threads pools can be dynamically resized as per https://github.com/apache/kafka/blob/092dc7fc467ed7d354ec504d6939b3fcd7b80632/core/src/main/scala/kafka/server/DynamicBrokerConfig.scala#L298 > Support dynamically resizing the KRaft controller's thread pools > > > Key: KAFKA-14349 > URL: https://issues.apache.org/jira/browse/KAFKA-14349 > Project: Kafka > Issue Type: Improvement >Reporter: Colin McCabe >Priority: Major > Labels: 4.0-blocker, kip-500 > > Support dynamically resizing the KRaft controller's request handler and > network handler thread pools. See {{DynamicBrokerConfig.scala}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14349) Support dynamically resizing the KRaft controller's thread pools
[ https://issues.apache.org/jira/browse/KAFKA-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816043#comment-17816043 ] Divij Vaidya commented on KAFKA-14349: -- [~cmccabe] can we remove "Modifying certain dynamic configurations on the standalone KRaft controller" from [https://kafka.apache.org/37/documentation.html#kraft_missing] after this JIRA? > Support dynamically resizing the KRaft controller's thread pools > > > Key: KAFKA-14349 > URL: https://issues.apache.org/jira/browse/KAFKA-14349 > Project: Kafka > Issue Type: Improvement >Reporter: Colin McCabe >Priority: Major > Labels: 4.0-blocker, kip-500 > > Support dynamically resizing the KRaft controller's request handler and > network handler thread pools. See {{DynamicBrokerConfig.scala}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14127) KIP-858: Handle JBOD broker disk failure in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17816038#comment-17816038 ] Divij Vaidya commented on KAFKA-14127: -- Hey folks 3.7's documentation still says that JBOD is a missing feature [1] in KRaft. Could we please fix that? [1] https://kafka.apache.org/37/documentation.html#kraft_missing > KIP-858: Handle JBOD broker disk failure in KRaft > - > > Key: KAFKA-14127 > URL: https://issues.apache.org/jira/browse/KAFKA-14127 > Project: Kafka > Issue Type: Improvement > Components: jbod, kraft >Reporter: Igor Soarez >Assignee: Igor Soarez >Priority: Major > Labels: 4.0-blocker, kip-500, kraft > Fix For: 3.7.0 > > > Supporting configurations with multiple storage directories in KRaft mode -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16239) Clean up references to non-existent IntegrationTestHelper
Divij Vaidya created KAFKA-16239: Summary: Clean up references to non-existent IntegrationTestHelper Key: KAFKA-16239 URL: https://issues.apache.org/jira/browse/KAFKA-16239 Project: Kafka Issue Type: Improvement Reporter: Divij Vaidya A bunch of places in the code javadocs and READ docs refer to a class called IntegrationTestHelper. Such a class does not exist. This task will clean up all referenced to IntegrationTestHelper from Kafka code base. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-9693: Fix Version/s: 3.7.0 (was: 3.8.0) > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Fix For: 3.7.0 > > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, image-2020-06-23-12-24-46-548.png, > image-2020-06-23-12-24-58-788.png, image-2020-06-26-13-43-21-723.png, > image-2020-06-26-13-46-52-861.png, image-2020-06-26-14-06-01-505.png, > latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I observed the same spiky pattern as in > the producer latency, with a one-to-one correspondence with the append > duration. > !image-2020-03-10-14-36-21-807.png|width=780,height=415! > {{FileRecords.append(MemoryRecords) dur
[jira] [Resolved] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-9693. - Resolution: Fixed This performance regression where producer snapshot fsync leads to high P99 latencies is fixed by https://issues.apache.org/jira/browse/KAFKA-15046 > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Fix For: 3.8.0 > > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, image-2020-06-23-12-24-46-548.png, > image-2020-06-23-12-24-58-788.png, image-2020-06-26-13-43-21-723.png, > image-2020-06-26-13-46-52-861.png, image-2020-06-26-14-06-01-505.png, > latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I observed the same spiky pattern as in > the producer latency, with a one-to-one correspondence with the append
[jira] [Resolved] (KAFKA-16210) Upgrade jose4j to 0.9.4
[ https://issues.apache.org/jira/browse/KAFKA-16210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-16210. -- Resolution: Fixed > Upgrade jose4j to 0.9.4 > --- > > Key: KAFKA-16210 > URL: https://issues.apache.org/jira/browse/KAFKA-16210 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.8.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16210) Upgrade jose4j to 0.9.4
[ https://issues.apache.org/jira/browse/KAFKA-16210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16210: - Fix Version/s: 3.8.0 > Upgrade jose4j to 0.9.4 > --- > > Key: KAFKA-16210 > URL: https://issues.apache.org/jira/browse/KAFKA-16210 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.8.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16210) Upgrade jose4j to 0.9.4
Divij Vaidya created KAFKA-16210: Summary: Upgrade jose4j to 0.9.4 Key: KAFKA-16210 URL: https://issues.apache.org/jira/browse/KAFKA-16210 Project: Kafka Issue Type: Improvement Reporter: Divij Vaidya Fix For: 3.7.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16066) Upgrade apacheds to 2.0.0.AM27
[ https://issues.apache.org/jira/browse/KAFKA-16066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809875#comment-17809875 ] Divij Vaidya commented on KAFKA-16066: -- [~high.lee] please feel free to pick this one up. There has been no activity from previous requester on this Jira for more than 20 days now > Upgrade apacheds to 2.0.0.AM27 > -- > > Key: KAFKA-16066 > URL: https://issues.apache.org/jira/browse/KAFKA-16066 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Major > Labels: newbie, newbie++ > > We are currently using a very old dependency. Notably, apacheds is only used > for testing when we use MiniKdc, hence, there is nothing stopping us from > upgrading it. > Notably, apacheds has removed the component > org.apache.directory.server:apacheds-protocol-kerberos in favour of using > Apache Kerby, hence, we need to make changes in MiniKdc.scala for this > upgrade to work correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Fix Version/s: 3.8.0 > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Fix For: 3.8.0 > > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, Screenshot 2024-01-10 at 14.59.47.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805163#comment-17805163 ] Divij Vaidya commented on KAFKA-16052: -- On current trunk, the heap "at max" goes to 1200MB compared to 1800MB provided in the description. I have also not seen any OOM for CI for quite a while now based on [https://ge.apache.org/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20%27:tools:test%27.%0A%3E%20Process%20%27Gradle%20Test%20Executor%2096%27%20finished%20with%20non-zero%20exit%20value%201%0A%20%20This%20problem%20might%20be%20caused%20by%20incorrect%20test%20process%20configuration.%0A%20%20For%20more%20on%20test%20execution%2C%20please%20refer%20to%20https:%2F%2Fdocs.gradle.org%2F8.5%2Fuserguide%2Fjava_testing.html%23sec:test_execution%20in%20the%20Gradle%20documentation.&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe%2FBerlin] Also, notice the drastic decrease in number of threads in the test (right graph) due to fixes made here. At this stage, I am resolving this Jira based on the above. We have some future looking tasks at [https://github.com/apache/kafka/pull/15101] to fix this permanently. !Screenshot 2024-01-10 at 14.59.47.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, Screenshot 2024-01-10 at 14.59.47.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2024-01-10 at 14.59.47.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, Screenshot 2024-01-10 at 14.59.47.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16088) Not reading active segments when RemoteFetch return Empty Records.
[ https://issues.apache.org/jira/browse/KAFKA-16088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16088: - Labels: tiered-storage (was: ) > Not reading active segments when RemoteFetch return Empty Records. > > > Key: KAFKA-16088 > URL: https://issues.apache.org/jira/browse/KAFKA-16088 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Reporter: Arpit Goyal >Priority: Critical > Labels: tiered-storage > > Please refer this comment for details > https://github.com/apache/kafka/pull/15060#issuecomment-1879657273 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16088) Not reading active segments when RemoteFetch return Empty Records.
[ https://issues.apache.org/jira/browse/KAFKA-16088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16088: - Component/s: Tiered-Storage > Not reading active segments when RemoteFetch return Empty Records. > > > Key: KAFKA-16088 > URL: https://issues.apache.org/jira/browse/KAFKA-16088 > Project: Kafka > Issue Type: Bug > Components: Tiered-Storage >Reporter: Arpit Goyal >Priority: Critical > > Please refer this comment for details > https://github.com/apache/kafka/pull/15060#issuecomment-1879657273 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16059: - Fix Version/s: 3.7.0 3.8.0 > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.8.0 > > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16074) Fix thread leaks in ReplicaManagerTest
[ https://issues.apache.org/jira/browse/KAFKA-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801744#comment-17801744 ] Divij Vaidya commented on KAFKA-16074: -- https://github.com/apache/kafka/pull/15077 > Fix thread leaks in ReplicaManagerTest > -- > > Key: KAFKA-16074 > URL: https://issues.apache.org/jira/browse/KAFKA-16074 > Project: Kafka > Issue Type: Sub-task >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > Following [@dajac|https://github.com/dajac] 's finding in > [#15063|https://github.com/apache/kafka/pull/15063], I found we also create > new RemoteLogManager in ReplicaManagerTest, but didn't close them. > While investigating ReplicaManagerTest, I also found there are other threads > leaking: > # remote fetch reaper thread. It's because we create a reaper thread in > test, which is not expected. We should create a mocked one like other > purgatory instance. > # Throttle threads. We created a {{quotaManager}} to feed into the > replicaManager, but didn't close it. Actually, we have created a global > {{quotaManager}} instance and will close it on {{{}AfterEach{}}}. We should > re-use it. > # replicaManager and logManager didn't invoke {{close}} after test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16074) Fix thread leaks in ReplicaManagerTest
Divij Vaidya created KAFKA-16074: Summary: Fix thread leaks in ReplicaManagerTest Key: KAFKA-16074 URL: https://issues.apache.org/jira/browse/KAFKA-16074 Project: Kafka Issue Type: Sub-task Reporter: Luke Chen Assignee: Luke Chen Following [@dajac|https://github.com/dajac] 's finding in [#15063|https://github.com/apache/kafka/pull/15063], I found we also create new RemoteLogManager in ReplicaManagerTest, but didn't close them. While investigating ReplicaManagerTest, I also found there are other threads leaking: # remote fetch reaper thread. It's because we create a reaper thread in test, which is not expected. We should create a mocked one like other purgatory instance. # Throttle threads. We created a {{quotaManager}} to feed into the replicaManager, but didn't close it. Actually, we have created a global {{quotaManager}} instance and will close it on {{{}AfterEach{}}}. We should re-use it. # replicaManager and logManager didn't invoke {{close}} after test. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16072) Create Junit 5 extension to detect thread leak
[ https://issues.apache.org/jira/browse/KAFKA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16072: - Fix Version/s: 3.8.0 > Create Junit 5 extension to detect thread leak > -- > > Key: KAFKA-16072 > URL: https://issues.apache.org/jira/browse/KAFKA-16072 > Project: Kafka > Issue Type: Improvement > Components: unit tests >Reporter: Divij Vaidya >Assignee: Dmitry Werner >Priority: Major > Labels: newbie++ > Fix For: 3.8.0 > > > The objective of this task is to create a Junit extension that will execute > after every test and verify that there are no lingering threads left over. > An example of how to create an extension can be found here: > [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] > An example on how to find unexpected threads is at > [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] > and also at > https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16063) Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests
[ https://issues.apache.org/jira/browse/KAFKA-16063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801577#comment-17801577 ] Divij Vaidya commented on KAFKA-16063: -- Nice find and yes, disabling shutdown hook sounds like a plan. I am curious though, if the map is not cleared during stop function, then who clears it? As an alternative solution, should we instead clear the map in stop function? This would ensure that even if we forget to call stop, on process exit, kdc will stop definitely due to the hooks. > Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests > - > > Key: KAFKA-16063 > URL: https://issues.apache.org/jira/browse/KAFKA-16063 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 12.38.29.png, Screenshot > 2023-12-31 at 7.19.25 AM.png, Screenshot 2024-01-01 at 10.51.03 PM-1.png, > Screenshot 2024-01-01 at 10.51.03 PM.png > > > All test extending `EndToEndAuthorizationTest` are leaking > DefaultDirectoryService objects. > This can be observed using the heap dump at > [https://www.dropbox.com/scl/fi/4jaq8rowkmijaoj7ec1nm/GradleWorkerMain_10311_27_12_2023_13_37_08_Leak_Suspects.zip?rlkey=minkbvopb0c65m5wryqw234xb&dl=0] > (unzip this and you will find a hprof which can be opened with your > favourite heap analyzer) > The stack trace looks like this: > !Screenshot 2023-12-29 at 12.38.29.png! > > I suspect that the reason is because DefaultDirectoryService#startup() > registers a shutdownhook which is somehow messed up by > QuorumTestHarness#teardown(). > We need to investigate why this is leaking and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16072) Create Junit 5 extension to detect thread leak
[ https://issues.apache.org/jira/browse/KAFKA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16072: - Summary: Create Junit 5 extension to detect thread leak (was: Create Mockito extension to detect thread leak) > Create Junit 5 extension to detect thread leak > -- > > Key: KAFKA-16072 > URL: https://issues.apache.org/jira/browse/KAFKA-16072 > Project: Kafka > Issue Type: Improvement > Components: unit tests >Reporter: Divij Vaidya >Priority: Major > Labels: newbie++ > > The objective of this task is to create a Mockito extension that will execute > after every test and verify that there are no lingering threads left over. > An example of how to create a Mockito extension can be found here: > [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] > An example on how to find unexpected threads is at > [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] > and also at > https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16072) Create Junit 5 extension to detect thread leak
[ https://issues.apache.org/jira/browse/KAFKA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801465#comment-17801465 ] Divij Vaidya commented on KAFKA-16072: -- Ah yes! Corrected it in the description. > Create Junit 5 extension to detect thread leak > -- > > Key: KAFKA-16072 > URL: https://issues.apache.org/jira/browse/KAFKA-16072 > Project: Kafka > Issue Type: Improvement > Components: unit tests >Reporter: Divij Vaidya >Priority: Major > Labels: newbie++ > > The objective of this task is to create a Junit extension that will execute > after every test and verify that there are no lingering threads left over. > An example of how to create an extension can be found here: > [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] > An example on how to find unexpected threads is at > [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] > and also at > https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16072) Create Junit 5 extension to detect thread leak
[ https://issues.apache.org/jira/browse/KAFKA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16072: - Description: The objective of this task is to create a Junit extension that will execute after every test and verify that there are no lingering threads left over. An example of how to create an extension can be found here: [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] An example on how to find unexpected threads is at [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] and also at https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 was: The objective of this task is to create a Mockito extension that will execute after every test and verify that there are no lingering threads left over. An example of how to create a Mockito extension can be found here: [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] An example on how to find unexpected threads is at [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] and also at https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 > Create Junit 5 extension to detect thread leak > -- > > Key: KAFKA-16072 > URL: https://issues.apache.org/jira/browse/KAFKA-16072 > Project: Kafka > Issue Type: Improvement > Components: unit tests >Reporter: Divij Vaidya >Priority: Major > Labels: newbie++ > > The objective of this task is to create a Junit extension that will execute > after every test and verify that there are no lingering threads left over. > An example of how to create an extension can be found here: > [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] > An example on how to find unexpected threads is at > [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] > and also at > https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16072) Create Mockito extension to detect thread leak
Divij Vaidya created KAFKA-16072: Summary: Create Mockito extension to detect thread leak Key: KAFKA-16072 URL: https://issues.apache.org/jira/browse/KAFKA-16072 Project: Kafka Issue Type: Improvement Components: unit tests Reporter: Divij Vaidya The objective of this task is to create a Mockito extension that will execute after every test and verify that there are no lingering threads left over. An example of how to create a Mockito extension can be found here: [https://github.com/apache/kafka/pull/14783/files#diff-812cfc2780b6fc0e7a1648ff37912ff13aeda4189ea6b0d4d847b831f66e56d1] An example on how to find unexpected threads is at [https://github.com/apache/kafka/blob/d5aa341a185f4df23bf587e55bcda4f16fc511f1/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L2427] and also at https://issues.apache.org/jira/browse/KAFKA-16052?focusedCommentId=17800978&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17800978 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16063) Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests
[ https://issues.apache.org/jira/browse/KAFKA-16063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801440#comment-17801440 ] Divij Vaidya commented on KAFKA-16063: -- Are you connecting the profiler to the right process? The low CPU and heap is both fishy in your profile. The process should be named GradleWorkerMain. Note that this process starts only "after" :core:test begins to run tests. Before that it performs compilation etc. > Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests > - > > Key: KAFKA-16063 > URL: https://issues.apache.org/jira/browse/KAFKA-16063 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 12.38.29.png, Screenshot > 2023-12-31 at 7.19.25 AM.png > > > All test extending `EndToEndAuthorizationTest` are leaking > DefaultDirectoryService objects. > This can be observed using the heap dump at > [https://www.dropbox.com/scl/fi/4jaq8rowkmijaoj7ec1nm/GradleWorkerMain_10311_27_12_2023_13_37_08_Leak_Suspects.zip?rlkey=minkbvopb0c65m5wryqw234xb&dl=0] > (unzip this and you will find a hprof which can be opened with your > favourite heap analyzer) > The stack trace looks like this: > !Screenshot 2023-12-29 at 12.38.29.png! > > I suspect that the reason is because DefaultDirectoryService#startup() > registers a shutdownhook which is somehow messed up by > QuorumTestHarness#teardown(). > We need to investigate why this is leaking and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14928) Metrics collection contends on lock with log cleaning
[ https://issues.apache.org/jira/browse/KAFKA-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-14928: - Fix Version/s: (was: 3.8.0) > Metrics collection contends on lock with log cleaning > - > > Key: KAFKA-14928 > URL: https://issues.apache.org/jira/browse/KAFKA-14928 > Project: Kafka > Issue Type: Bug >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > > In LogCleanerManager.scala, calculation of a metric requires a lock [1]. This > same lock is required by core log cleaner functionality such as > "grabFilthiestCompactedLog". This might lead to a situation where metric > calculation holding the lock for an extended period of time may affect the > core functionality of log cleaning. > This outcome of this task is to prevent expensive metric calculation from > blocking log cleaning/compaction activity. > [1] > https://github.com/apache/kafka/blob/dd63d88ac3ea7a9a55a6dacf9c5473e939322a55/core/src/main/scala/kafka/log/LogCleanerManager.scala#L102 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16066) Upgrade apacheds to 2.0.0.AM27
[ https://issues.apache.org/jira/browse/KAFKA-16066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801394#comment-17801394 ] Divij Vaidya commented on KAFKA-16066: -- Sure [~anishlukk123] please go ahead. You should be able to assign it to yourself. > Upgrade apacheds to 2.0.0.AM27 > -- > > Key: KAFKA-16066 > URL: https://issues.apache.org/jira/browse/KAFKA-16066 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Major > Labels: newbie, newbie++ > > We are currently using a very old dependency. Notably, apacheds is only used > for testing when we use MiniKdc, hence, there is nothing stopping us from > upgrading it. > Notably, apacheds has removed the component > org.apache.directory.server:apacheds-protocol-kerberos in favour of using > Apache Kerby, hence, we need to make changes in MiniKdc.scala for this > upgrade to work correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16064) Improve ControllerApisTest
[ https://issues.apache.org/jira/browse/KAFKA-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16064: - Fix Version/s: 3.8.0 > Improve ControllerApisTest > -- > > Key: KAFKA-16064 > URL: https://issues.apache.org/jira/browse/KAFKA-16064 > Project: Kafka > Issue Type: Test >Reporter: Luke Chen >Assignee: Dmitry Werner >Priority: Major > Labels: newbie, newbie++ > Fix For: 3.8.0 > > > It's usually more robust to automatically handle clean-up during tearDown by > instrumenting the create method so that it keeps track of all creations. > > context: > https://github.com/apache/kafka/pull/15084#issuecomment-1871302733 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-6527) Transient failure in DynamicBrokerReconfigurationTest.testDefaultTopicConfig
[ https://issues.apache.org/jira/browse/KAFKA-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-6527: Labels: flakey flaky-test (was: flakey) > Transient failure in DynamicBrokerReconfigurationTest.testDefaultTopicConfig > > > Key: KAFKA-6527 > URL: https://issues.apache.org/jira/browse/KAFKA-6527 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Priority: Blocker > Labels: flakey, flaky-test > Fix For: 3.8.0 > > > {code:java} > java.lang.AssertionError: Log segment size increase not applied > at kafka.utils.TestUtils$.fail(TestUtils.scala:355) > at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:865) > at > kafka.server.DynamicBrokerReconfigurationTest.testDefaultTopicConfig(DynamicBrokerReconfigurationTest.scala:348) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16062) Upgrade mockito to 5.8.0
[ https://issues.apache.org/jira/browse/KAFKA-16062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16062: - Fix Version/s: 3.7.0 > Upgrade mockito to 5.8.0 > > > Key: KAFKA-16062 > URL: https://issues.apache.org/jira/browse/KAFKA-16062 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.8.0 > > > Upgrading to use the latest version of mockito. Updated from 5.5.0 to 5.8.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16065) Fix leak in DelayedOperationTest
[ https://issues.apache.org/jira/browse/KAFKA-16065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16065: - Fix Version/s: 3.6.2 > Fix leak in DelayedOperationTest > > > Key: KAFKA-16065 > URL: https://issues.apache.org/jira/browse/KAFKA-16065 > Project: Kafka > Issue Type: Sub-task >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > Fix For: 3.7.0, 3.6.2, 3.8.0 > > > Fix leak in DelayedOperationTest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16065) Fix leak in DelayedOperationTest
[ https://issues.apache.org/jira/browse/KAFKA-16065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16065: - Fix Version/s: 3.7.0 3.8.0 > Fix leak in DelayedOperationTest > > > Key: KAFKA-16065 > URL: https://issues.apache.org/jira/browse/KAFKA-16065 > Project: Kafka > Issue Type: Sub-task >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > Fix For: 3.7.0, 3.8.0 > > > Fix leak in DelayedOperationTest. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16062) Upgrade mockito to 5.8.0
[ https://issues.apache.org/jira/browse/KAFKA-16062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16062: - Fix Version/s: 3.8.0 > Upgrade mockito to 5.8.0 > > > Key: KAFKA-16062 > URL: https://issues.apache.org/jira/browse/KAFKA-16062 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.8.0 > > > Upgrading to use the latest version of mockito. Updated from 5.5.0 to 5.8.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801185#comment-17801185 ] Divij Vaidya commented on KAFKA-16059: -- I found that there are actually leaked threads in KafkaApisTest as well. Have started a PR associated with this Jira to fix that. > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya reassigned KAFKA-16059: Assignee: Divij Vaidya (was: Arpit Goyal) > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16064) improve ControllerApiTest
[ https://issues.apache.org/jira/browse/KAFKA-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya reassigned KAFKA-16064: Assignee: Dmitry (was: Dmitry) > improve ControllerApiTest > - > > Key: KAFKA-16064 > URL: https://issues.apache.org/jira/browse/KAFKA-16064 > Project: Kafka > Issue Type: Test >Reporter: Luke Chen >Assignee: Dmitry >Priority: Major > Labels: newbie, newbie++ > > It's usually more robust to automatically handle clean-up during tearDown by > instrumenting the create method so that it keeps track of all creations. > > context: > https://github.com/apache/kafka/pull/15084#issuecomment-1871302733 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16066) Upgrade apacheds to 2.0.0.AM27
[ https://issues.apache.org/jira/browse/KAFKA-16066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16066: - Labels: newbie newbie++ (was: newbie) > Upgrade apacheds to 2.0.0.AM27 > -- > > Key: KAFKA-16066 > URL: https://issues.apache.org/jira/browse/KAFKA-16066 > Project: Kafka > Issue Type: Improvement >Reporter: Divij Vaidya >Priority: Major > Labels: newbie, newbie++ > > We are currently using a very old dependency. Notably, apacheds is only used > for testing when we use MiniKdc, hence, there is nothing stopping us from > upgrading it. > Notably, apacheds has removed the component > org.apache.directory.server:apacheds-protocol-kerberos in favour of using > Apache Kerby, hence, we need to make changes in MiniKdc.scala for this > upgrade to work correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16064) improve ControllerApiTest
[ https://issues.apache.org/jira/browse/KAFKA-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801161#comment-17801161 ] Divij Vaidya commented on KAFKA-16064: -- Hey [~javakillah] You should be able to assign JIRAs to yourself now. I have meanwhile assigned this to you. > improve ControllerApiTest > - > > Key: KAFKA-16064 > URL: https://issues.apache.org/jira/browse/KAFKA-16064 > Project: Kafka > Issue Type: Test >Reporter: Luke Chen >Assignee: Dmitry >Priority: Major > Labels: newbie, newbie++ > > It's usually more robust to automatically handle clean-up during tearDown by > instrumenting the create method so that it keeps track of all creations. > > context: > https://github.com/apache/kafka/pull/15084#issuecomment-1871302733 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16064) improve ControllerApiTest
[ https://issues.apache.org/jira/browse/KAFKA-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya reassigned KAFKA-16064: Assignee: Dmitry > improve ControllerApiTest > - > > Key: KAFKA-16064 > URL: https://issues.apache.org/jira/browse/KAFKA-16064 > Project: Kafka > Issue Type: Test >Reporter: Luke Chen >Assignee: Dmitry >Priority: Major > Labels: newbie, newbie++ > > It's usually more robust to automatically handle clean-up during tearDown by > instrumenting the create method so that it keeps track of all creations. > > context: > https://github.com/apache/kafka/pull/15084#issuecomment-1871302733 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16066) Upgrade apacheds to 2.0.0.AM27
Divij Vaidya created KAFKA-16066: Summary: Upgrade apacheds to 2.0.0.AM27 Key: KAFKA-16066 URL: https://issues.apache.org/jira/browse/KAFKA-16066 Project: Kafka Issue Type: Improvement Reporter: Divij Vaidya We are currently using a very old dependency. Notably, apacheds is only used for testing when we use MiniKdc, hence, there is nothing stopping us from upgrading it. Notably, apacheds has removed the component org.apache.directory.server:apacheds-protocol-kerberos in favour of using Apache Kerby, hence, we need to make changes in MiniKdc.scala for this upgrade to work correctly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801153#comment-17801153 ] Divij Vaidya commented on KAFKA-16052: -- [~jolshan] [~showuon] - I found another culprit - https://issues.apache.org/jira/browse/KAFKA-16063 > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16063) Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests
[ https://issues.apache.org/jira/browse/KAFKA-16063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801152#comment-17801152 ] Divij Vaidya commented on KAFKA-16063: -- All tests that are using "SaslSetup" are leaking these objects "ApacheDS Shutdown Hook" > Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests > - > > Key: KAFKA-16063 > URL: https://issues.apache.org/jira/browse/KAFKA-16063 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-29 at 12.38.29.png > > > All test extending `EndToEndAuthorizationTest` are leaking > DefaultDirectoryService objects. > This can be observed using the heap dump at > [https://www.dropbox.com/scl/fi/4jaq8rowkmijaoj7ec1nm/GradleWorkerMain_10311_27_12_2023_13_37_08_Leak_Suspects.zip?rlkey=minkbvopb0c65m5wryqw234xb&dl=0] > (unzip this and you will find a hprof which can be opened with your > favourite heap analyzer) > The stack trace looks like this: > !Screenshot 2023-12-29 at 12.38.29.png! > > I suspect that the reason is because DefaultDirectoryService#startup() > registers a shutdownhook which is somehow messed up by > QuorumTestHarness#teardown(). > We need to investigate why this is leaking and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16063) Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests
Divij Vaidya created KAFKA-16063: Summary: Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests Key: KAFKA-16063 URL: https://issues.apache.org/jira/browse/KAFKA-16063 Project: Kafka Issue Type: Sub-task Reporter: Divij Vaidya Attachments: Screenshot 2023-12-29 at 12.38.29.png All test extending `EndToEndAuthorizationTest` are leaking DefaultDirectoryService objects. This can be observed using the heap dump at [https://www.dropbox.com/scl/fi/4jaq8rowkmijaoj7ec1nm/GradleWorkerMain_10311_27_12_2023_13_37_08_Leak_Suspects.zip?rlkey=minkbvopb0c65m5wryqw234xb&dl=0] (unzip this and you will find a hprof which can be opened with your favourite heap analyzer) The stack trace looks like this: !Screenshot 2023-12-29 at 12.38.29.png! I suspect that the reason is because DefaultDirectoryService#startup() registers a shutdownhook which is somehow messed up by QuorumTestHarness#teardown(). We need to investigate why this is leaking and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16063) Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests
[ https://issues.apache.org/jira/browse/KAFKA-16063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801147#comment-17801147 ] Divij Vaidya commented on KAFKA-16063: -- [~goyarpit] if you are interested, you can pick this one. > Fix leaked ApplicationShutdownHooks in EndToEndAuthorizationTests > - > > Key: KAFKA-16063 > URL: https://issues.apache.org/jira/browse/KAFKA-16063 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-29 at 12.38.29.png > > > All test extending `EndToEndAuthorizationTest` are leaking > DefaultDirectoryService objects. > This can be observed using the heap dump at > [https://www.dropbox.com/scl/fi/4jaq8rowkmijaoj7ec1nm/GradleWorkerMain_10311_27_12_2023_13_37_08_Leak_Suspects.zip?rlkey=minkbvopb0c65m5wryqw234xb&dl=0] > (unzip this and you will find a hprof which can be opened with your > favourite heap analyzer) > The stack trace looks like this: > !Screenshot 2023-12-29 at 12.38.29.png! > > I suspect that the reason is because DefaultDirectoryService#startup() > registers a shutdownhook which is somehow messed up by > QuorumTestHarness#teardown(). > We need to investigate why this is leaking and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16062) Upgrade mockito to 5.8.0
Divij Vaidya created KAFKA-16062: Summary: Upgrade mockito to 5.8.0 Key: KAFKA-16062 URL: https://issues.apache.org/jira/browse/KAFKA-16062 Project: Kafka Issue Type: Sub-task Reporter: Divij Vaidya Assignee: Divij Vaidya Upgrading to use the latest version of mockito. Updated from 5.5.0 to 5.8.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16059: - Attachment: Screenshot 2023-12-29 at 11.13.01.png > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16060) Some questions about tiered storage capabilities
[ https://issues.apache.org/jira/browse/KAFKA-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-16060. -- Resolution: Not A Problem > Some questions about tiered storage capabilities > > > Key: KAFKA-16060 > URL: https://issues.apache.org/jira/browse/KAFKA-16060 > Project: Kafka > Issue Type: Wish > Components: core >Affects Versions: 3.6.1 >Reporter: Jianbin Chen >Priority: Major > > # If a topic has 3 replicas, when the local expiration time is reached, will > all 3 replicas trigger the log transfer to the remote storage, or will only > the leader in the isr transfer the log to the remote storage (hdfs, s3) > # Topics that do not support compression, do you mean topics that > log.cleanup.policy=compact? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801128#comment-17801128 ] Divij Vaidya commented on KAFKA-16059: -- Having said that, please wait before picking up this JIRA. I think this is resolved by the commit I mentioned above. I am verifying it now by running the full suite. > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801128#comment-17801128 ] Divij Vaidya edited comment on KAFKA-16059 at 12/29/23 10:33 AM: - Having said that, please wait before starting to work on this JIRA. I think this is resolved by the commit I mentioned above. I am verifying it now by running the full suite. was (Author: divijvaidya): Having said that, please wait before picking up this JIRA. I think this is resolved by the commit I mentioned above. I am verifying it now by running the full suite. > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801127#comment-17801127 ] Divij Vaidya commented on KAFKA-16059: -- Hey [~goyarpit] I am using intellij profiler and it's quite simple to use, no additional setup required. First, we will ensure that tests are executed using a single thread. You can specify it using the build parameter maxParallelForks, i.e. you execute the command `./gradlew -PmaxParallelForks=1 -PmaxScalacThreads=1 :core:test` Now, since the tests are executing, you can attach your favourite profiler to it. I am using Intellij profiler, where you select the process you want to attach the profiler to, right click and then click on "CPU and Memory live charts". You can also take a heap dump and a thread dump using this interface. !Screenshot 2023-12-29 at 11.13.01.png! > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Arpit Goyal >Priority: Major > Attachments: Screenshot 2023-12-29 at 11.13.01.png > > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16060) Some questions about tiered storage capabilities
[ https://issues.apache.org/jira/browse/KAFKA-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801126#comment-17801126 ] Divij Vaidya commented on KAFKA-16060: -- Hey [~jianbin] Questions are best asked by sending an email to developer mailing list or user mailing list specified at [https://kafka.apache.org/contact] I will answer your questions as a one time exception here but in future, please send an email. 1. It's only the leader that copies data to remote storage. Any replica will not delete logs locally even if local expiration time is reaches until it knows that leader has copied the specifed log to remote, i.e. data from local storage is never removed until there is a copy available in remote. You can find more information about this at [https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage] 2. Yes, compression sounds like a typo. Where did you find it? As a reference, the early access notes for Tiered Storage at [https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Tiered+Storage+Early+Access+Release+Notes] mention it more clearly. > Some questions about tiered storage capabilities > > > Key: KAFKA-16060 > URL: https://issues.apache.org/jira/browse/KAFKA-16060 > Project: Kafka > Issue Type: Wish > Components: core >Affects Versions: 3.6.1 >Reporter: Jianbin Chen >Priority: Major > > # If a topic has 3 replicas, when the local expiration time is reached, will > all 3 replicas trigger the log transfer to the remote storage, or will only > the leader in the isr transfer the log to the remote storage (hdfs, s3) > # Topics that do not support compression, do you mean topics that > log.cleanup.policy=compact? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16015) kafka-leader-election timeout values always overwritten by default values
[ https://issues.apache.org/jira/browse/KAFKA-16015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya resolved KAFKA-16015. -- Resolution: Fixed > kafka-leader-election timeout values always overwritten by default values > -- > > Key: KAFKA-16015 > URL: https://issues.apache.org/jira/browse/KAFKA-16015 > Project: Kafka > Issue Type: Bug > Components: admin, tools >Affects Versions: 3.5.1, 3.6.1 >Reporter: Sergio Troiano >Assignee: Sergio Troiano >Priority: Minor > Fix For: 3.7.0, 3.8.0 > > > Using the *kafka-leader-election.sh* I was getting random timeouts like these: > {code:java} > Error completing leader election (PREFERRED) for partition: > sebatestemptytopic-4: org.apache.kafka.common.errors.TimeoutException: The > request timed out. > Error completing leader election (PREFERRED) for partition: > __CruiseControlMetrics-3: org.apache.kafka.common.errors.TimeoutException: > The request timed out. > Error completing leader election (PREFERRED) for partition: > __KafkaCruiseControlModelTrainingSamples-18: > org.apache.kafka.common.errors.TimeoutException: The request timed out. > Error completing leader election (PREFERRED) for partition: > __KafkaCruiseControlPartitionMetricSamples-8: > org.apache.kafka.common.errors.TimeoutException: The request timed out. {code} > These timeouts were raised from the client side as the controller always > finished with all the Kafka leader elections. > One pattern I detected was always the timeouts were raised after about 15 > seconds. > > So i checked this command has an option to pass configurations > {code:java} > Option Description > -- --- > --admin.config Configuration properties files to pass > to the admin client {code} > I created the file in order to increment the values of *request.timeout.ms* > and *default.api.timeout.ms.* So even after increasing these values I got > the same result, timeouts were happening, like the new values were not having > any effect. > So I checked the source code and I came across with a bug, no matter the > value we pass to the timeouts the default values were ALWAYS overwriting them. > > This is the[3.6 > branch|https://github.com/apache/kafka/blob/3.6/core/src/main/scala/kafka/admin/LeaderElectionCommand.scala#L42] > {code:java} > object LeaderElectionCommand extends Logging { > def main(args: Array[String]): Unit = { > run(args, 30.second) > } def run(args: Array[String], timeout: Duration): Unit = { > val commandOptions = new LeaderElectionCommandOptions(args) > CommandLineUtils.maybePrintHelpOrVersion( > commandOptions, > "This tool attempts to elect a new leader for a set of topic > partitions. The type of elections supported are preferred replicas and > unclean replicas." > ) validate(commandOptions) val electionType = > commandOptions.options.valueOf(commandOptions.electionType) val > jsonFileTopicPartitions = > Option(commandOptions.options.valueOf(commandOptions.pathToJsonFile)).map { > path => > parseReplicaElectionData(Utils.readFileAsString(path)) > } val singleTopicPartition = ( > Option(commandOptions.options.valueOf(commandOptions.topic)), > Option(commandOptions.options.valueOf(commandOptions.partition)) > ) match { > case (Some(topic), Some(partition)) => Some(Set(new > TopicPartition(topic, partition))) > case _ => None > } /* Note: No need to look at --all-topic-partitions as we want this > to be None if it is use. > * The validate function should be checking that this option is required > if the --topic and --path-to-json-file > * are not specified. > */ > val topicPartitions = > jsonFileTopicPartitions.orElse(singleTopicPartition) val adminClient = { > val props = > Option(commandOptions.options.valueOf(commandOptions.adminClientConfig)).map > { config => > Utils.loadProps(config) > }.getOrElse(new Properties()) props.setProperty( > AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, > commandOptions.options.valueOf(commandOptions.bootstrapServer) > ) > props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, > timeout.toMillis.toString) > props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, > (timeout.toMillis / 2).toString) Admin.create(props) > } {code} > As we can see the default timeout is 30 seconds, and the request timeout is > 30/2 which validates the 15 seconds timeout. > Also we can see in the code how the custom values passed by the config file > are overwritten by the defaults. > > >
[jira] [Updated] (KAFKA-16015) kafka-leader-election timeout values always overwritten by default values
[ https://issues.apache.org/jira/browse/KAFKA-16015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16015: - Fix Version/s: 3.7.0 3.8.0 > kafka-leader-election timeout values always overwritten by default values > -- > > Key: KAFKA-16015 > URL: https://issues.apache.org/jira/browse/KAFKA-16015 > Project: Kafka > Issue Type: Bug > Components: admin, tools >Affects Versions: 3.5.1, 3.6.1 >Reporter: Sergio Troiano >Assignee: Sergio Troiano >Priority: Minor > Fix For: 3.7.0, 3.8.0 > > > Using the *kafka-leader-election.sh* I was getting random timeouts like these: > {code:java} > Error completing leader election (PREFERRED) for partition: > sebatestemptytopic-4: org.apache.kafka.common.errors.TimeoutException: The > request timed out. > Error completing leader election (PREFERRED) for partition: > __CruiseControlMetrics-3: org.apache.kafka.common.errors.TimeoutException: > The request timed out. > Error completing leader election (PREFERRED) for partition: > __KafkaCruiseControlModelTrainingSamples-18: > org.apache.kafka.common.errors.TimeoutException: The request timed out. > Error completing leader election (PREFERRED) for partition: > __KafkaCruiseControlPartitionMetricSamples-8: > org.apache.kafka.common.errors.TimeoutException: The request timed out. {code} > These timeouts were raised from the client side as the controller always > finished with all the Kafka leader elections. > One pattern I detected was always the timeouts were raised after about 15 > seconds. > > So i checked this command has an option to pass configurations > {code:java} > Option Description > -- --- > --admin.config Configuration properties files to pass > to the admin client {code} > I created the file in order to increment the values of *request.timeout.ms* > and *default.api.timeout.ms.* So even after increasing these values I got > the same result, timeouts were happening, like the new values were not having > any effect. > So I checked the source code and I came across with a bug, no matter the > value we pass to the timeouts the default values were ALWAYS overwriting them. > > This is the[3.6 > branch|https://github.com/apache/kafka/blob/3.6/core/src/main/scala/kafka/admin/LeaderElectionCommand.scala#L42] > {code:java} > object LeaderElectionCommand extends Logging { > def main(args: Array[String]): Unit = { > run(args, 30.second) > } def run(args: Array[String], timeout: Duration): Unit = { > val commandOptions = new LeaderElectionCommandOptions(args) > CommandLineUtils.maybePrintHelpOrVersion( > commandOptions, > "This tool attempts to elect a new leader for a set of topic > partitions. The type of elections supported are preferred replicas and > unclean replicas." > ) validate(commandOptions) val electionType = > commandOptions.options.valueOf(commandOptions.electionType) val > jsonFileTopicPartitions = > Option(commandOptions.options.valueOf(commandOptions.pathToJsonFile)).map { > path => > parseReplicaElectionData(Utils.readFileAsString(path)) > } val singleTopicPartition = ( > Option(commandOptions.options.valueOf(commandOptions.topic)), > Option(commandOptions.options.valueOf(commandOptions.partition)) > ) match { > case (Some(topic), Some(partition)) => Some(Set(new > TopicPartition(topic, partition))) > case _ => None > } /* Note: No need to look at --all-topic-partitions as we want this > to be None if it is use. > * The validate function should be checking that this option is required > if the --topic and --path-to-json-file > * are not specified. > */ > val topicPartitions = > jsonFileTopicPartitions.orElse(singleTopicPartition) val adminClient = { > val props = > Option(commandOptions.options.valueOf(commandOptions.adminClientConfig)).map > { config => > Utils.loadProps(config) > }.getOrElse(new Properties()) props.setProperty( > AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, > commandOptions.options.valueOf(commandOptions.bootstrapServer) > ) > props.setProperty(AdminClientConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, > timeout.toMillis.toString) > props.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, > (timeout.toMillis / 2).toString) Admin.create(props) > } {code} > As we can see the default timeout is 30 seconds, and the request timeout is > 30/2 which validates the 15 seconds timeout. > Also we can see in the code how the custom values passed by the config file > are overwritten b
[jira] [Commented] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801035#comment-17801035 ] Divij Vaidya commented on KAFKA-16059: -- This might have been fixed by [https://github.com/apache/kafka/commit/a465fb124f95d86e87238fe2f431df7bcb01e8ef] > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Priority: Major > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801034#comment-17801034 ] Divij Vaidya commented on KAFKA-16052: -- The fix in the PR for these improved but did not completely fix the OOM. Here's the status now. The heap dump shows Mockito invocations of different types such a place where we are mocking FileRrcords with each invocation consuming 5MB of heap. We will end up fixing many tests to fix this. But I am curious as to why Mockito is not cleaning up it's invocations? Why is it a "leak" after the test has finished executing? Should we try to upgrade mockito version and see if that fixes things? Another second source of leak is ApplicationShutdownHooks which starts when running EndToEndAuthorization tests. It has something to do with KDC server since we also have DefaultDirectoryService retained objects on the heap. I will start a child Jira to look into this. The other part is leaked threads. You will notice on the picture below that leaked suddenly spike (not correlated to heap memory increase) by hundreds. A thread dump suggests large number of ExpirationReaper-AlterACL threads. I am tracking that here: https://issues.apache.org/jira/browse/KAFKA-16059 !Screenshot 2023-12-28 at 18.44.19.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-28 at 18.44.19.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, Screenshot > 2023-12-28 at 18.44.19.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16059) Fix trhead l
Divij Vaidya created KAFKA-16059: Summary: Fix trhead l Key: KAFKA-16059 URL: https://issues.apache.org/jira/browse/KAFKA-16059 Project: Kafka Issue Type: Sub-task Reporter: Divij Vaidya We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the tests in :core:test {code:java} "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition java.lang.Thread.State: TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) at java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) at java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) at java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) at app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) at app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) at app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) at app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) {code} The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16059) Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test
[ https://issues.apache.org/jira/browse/KAFKA-16059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16059: - Summary: Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test (was: Fix trhead l) > Fix leak of ExpirationReaper-1-AlterAcls threads in :core:test > -- > > Key: KAFKA-16059 > URL: https://issues.apache.org/jira/browse/KAFKA-16059 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Priority: Major > > We are leaking hundreds of ExpirationReaper-1-AlterAcls threads in one of the > tests in :core:test > {code:java} > "ExpirationReaper-1-AlterAcls" prio=0 tid=0x0 nid=0x0 waiting on condition > java.lang.Thread.State: TIMED_WAITING > on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@3688fc67 > at java.base@17.0.9/jdk.internal.misc.Unsafe.park(Native Method) > at > java.base@17.0.9/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252) > at > java.base@17.0.9/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1672) > at > java.base@17.0.9/java.util.concurrent.DelayQueue.poll(DelayQueue.java:265) > at > app//org.apache.kafka.server.util.timer.SystemTimer.advanceClock(SystemTimer.java:87) > at > app//kafka.server.DelayedOperationPurgatory.advanceClock(DelayedOperation.scala:418) > at > app//kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper.doWork(DelayedOperation.scala:444) > at > app//org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:131) > {code} > The objective of this Jira is to identify the test and fix this leak -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16058) Fix leaked in ControllerApiTest
[ https://issues.apache.org/jira/browse/KAFKA-16058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16058: - Fix Version/s: 3.7.0 3.8.0 > Fix leaked in ControllerApiTest > --- > > Key: KAFKA-16058 > URL: https://issues.apache.org/jira/browse/KAFKA-16058 > Project: Kafka > Issue Type: Sub-task >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > Fix For: 3.7.0, 3.8.0 > > > PR: https://github.com/apache/kafka/pull/15084 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800960#comment-17800960 ] Divij Vaidya edited comment on KAFKA-16052 at 12/28/23 10:27 AM: - I have verified [~showuon] that using actual ReplicaManager as in your PR fixes the heap utilization of this test. Here's a before/after picture of heap utilization with this test. I am now running a full test suite to validate the overall impact. Before: !Screenshot 2023-12-28 at 11.26.03.png! After: !Screenshot 2023-12-28 at 11.26.09.png! was (Author: divijvaidya): I can verify [~showuon] that using actual ReplicaManager as in your PR fixes it. Before: !Screenshot 2023-12-28 at 11.26.03.png! After: !Screenshot 2023-12-28 at 11.26.09.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-28 at 11.26.09.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16052: - Attachment: Screenshot 2023-12-28 at 11.26.03.png > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800960#comment-17800960 ] Divij Vaidya commented on KAFKA-16052: -- I can verify [~showuon] that using actual ReplicaManager as in your PR fixes it. Before: !Screenshot 2023-12-28 at 11.26.03.png! After: !Screenshot 2023-12-28 at 11.26.09.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService
[ https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16053: - Fix Version/s: 3.8.0 > Fix leaked Default DirectoryService > --- > > Key: KAFKA-16053 > URL: https://issues.apache.org/jira/browse/KAFKA-16053 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.6.2, 3.8.0 > > Attachments: Screenshot 2023-12-27 at 13.18.33.png > > > Heap dump hinted towards a leaked DefaultDirectoryService while running > :core:test. It used 123MB of retained memory. > This Jira fixes the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16053) Fix leaked Default DirectoryService
[ https://issues.apache.org/jira/browse/KAFKA-16053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Divij Vaidya updated KAFKA-16053: - Fix Version/s: 3.7.0 3.6.2 > Fix leaked Default DirectoryService > --- > > Key: KAFKA-16053 > URL: https://issues.apache.org/jira/browse/KAFKA-16053 > Project: Kafka > Issue Type: Sub-task >Reporter: Divij Vaidya >Assignee: Divij Vaidya >Priority: Major > Fix For: 3.7.0, 3.6.2 > > Attachments: Screenshot 2023-12-27 at 13.18.33.png > > > Heap dump hinted towards a leaked DefaultDirectoryService while running > :core:test. It used 123MB of retained memory. > This Jira fixes the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17800893#comment-17800893 ] Divij Vaidya edited comment on KAFKA-16052 at 12/27/23 11:23 PM: - Also, when we look at what is inside the InterceptedInvocation objects, it hints at " groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", producerId, producerEpoch, JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, JoinGroupRequest.UNKNOWN_GENERATION_ID, offsets, callbackWithTxnCompletion)" line. Note that there are ~700L invocations in mockito for this. (You can also play with the heap dump I linked above to find this information) !Screenshot 2023-12-28 at 00.18.56.png! was (Author: divijvaidya): Also, when we look at what is inside the InterceptedInvocation objects, it hints at " groupCoordinator.handleTxnCommitOffsets(member.group.groupId, "dummy-txn-id", producerId, producerEpoch, JoinGroupRequest.UNKNOWN_MEMBER_ID, Option.empty, JoinGroupRequest.UNKNOWN_GENERATION_ID, offsets, callbackWithTxnCompletion)" line. Note that there are ~700L invocations in mockito for this. !Screenshot 2023-12-28 at 00.18.56.png! > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln&dl=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)