[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar updated KAFKA-16226: -- Fix Version/s: 3.6.2 (was: 3.6.3) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > Lock profiles clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline ( see below ). Note > the synchronisation is much worse for paritionReady() in this benchmark as > its called for each partition, and it has 36k partitions! > h3. Lock Profile: Kafka-15415 > !kafka_15415_lock_profile.png! > h3. Lock Profile: Baseline > !baseline_lock_profile.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > With the fix, lock-profile & metrics are similar to baseline. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar updated KAFKA-16226: -- Fix Version/s: 3.6.3 (was: 3.6.2) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.8.0, 3.7.1, 3.6.3 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > Lock profiles clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline ( see below ). Note > the synchronisation is much worse for paritionReady() in this benchmark as > its called for each partition, and it has 36k partitions! > h3. Lock Profile: Kafka-15415 > !kafka_15415_lock_profile.png! > h3. Lock Profile: Baseline > !baseline_lock_profile.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > With the fix, lock-profile & metrics are similar to baseline. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. Lock profiles clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline ( see below ). Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h3. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h3. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). With the fix, lock-profile & metrics are similar to baseline. was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h3. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h3. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). With the fix, lock-profile & metrics are similar to baseline. > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that c
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h3. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h3. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). With the fix, lock-profile & metrics are similar to baseline. was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h3. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h3. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-bat
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h3. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h3. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h2. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increase
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: baseline_lock_profile.png > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !kafka_15415_lock_profile.png! h2. Lock Profile: Baseline !baseline_lock_profile.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: kafka_15415_lock_profile.png > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: (was: image-20240201-105752.png) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: (was: Screenshot 2024-02-01 at 11.06.36.png) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > Synchronization has to be reduced between 2 threads in order to address this. > [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids > using Metadata.currentLeader() instead rely on Cluster.leaderFor(). > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix Synchronization has to be reduced between 2 threads in order to address this. [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids using Metadata.currentLeader() instead rely on Cluster.leaderFor(). was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384]] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384]] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > > -- This messa
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|https://github.com/apache/kafka/pull/14384] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384]] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|https://github.com/apache/kafka/pull/14384] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > > -- This message wa
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix was: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384],] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix > > -- This message
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: image-20240201-105752.png > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > How it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384],] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: h1. Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. h1. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. h1. Why it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! h2. Lock Profile: Kafka-15415 !Screenshot 2024-02-01 at 11.06.36.png! h2. Lock Profile: Baseline !image-20240201-105752.png! h1. Fix was: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png, > image-20240201-105752.png > > > h1. Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > h1. What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > h1. Why it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384],] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > h2. Lock Profile: Kafka-15415 > !Screenshot 2024-02-01 at 11.06.36.png! > h2. Lock Profile: Baseline > !image-20240201-105752.png! > h1. Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Attachment: Screenshot 2024-02-01 at 11.06.36.png > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > Attachments: Screenshot 2024-02-01 at 11.06.36.png > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > How it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384],] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[https://github.com/apache/kafka/pull/14384],] RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using synchronised method Metadata.currentLeader(). This has led to increased synchronization between KafkaProducer's application-thread that call send(), and background-thread that actively send producer-batches to leaders. See lock profiles that clearly show increased synchronisation in KAFKA-15415 PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the synchronisation is much worse for paritionReady() in this benchmark as its called for each partition, and it has 36k partitions! Fix was: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384] Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # request-latency-avg: increased from 50ms to 100ms. > How it happened > As can be seen from the original > [PR|[https://github.com/apache/kafka/pull/14384],] > RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using > synchronised method Metadata.currentLeader(). This has led to increased > synchronization between KafkaProducer's application-thread that call send(), > and background-thread that actively send producer-batches to leaders. > See lock profiles that clearly show increased synchronisation in KAFKA-15415 > PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the > synchronisation is much worse for paritionReady() in this benchmark as its > called for each partition, and it has 36k partitions! > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384] Fix was: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # > request-latency-avg: increased from 50ms to 100ms. > How it happened > As can be seen from the original > [PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384] > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Description: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. # record-queue-time-avg: increased from 20ms to 30ms. # request-latency-avg: increased from 50ms to 100ms. How it happened As can be seen from the original [PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] Fix was: Background https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in java-client to skip backoff period if client knows of a newer leader, for produce-batch being retried. What changed The implementation introduced a regression noticed on a trogdor-benchmark running with high partition counts(36000!). With regression, following metrics changed on the produce side. 1. record_queue_time_avg Regression Details Fix > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > # record-queue-time-avg: increased from 20ms to 30ms. > # > request-latency-avg: increased from 50ms to 100ms. > How it happened > As can be seen from the original > [PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Affects Version/s: 3.6.1 3.7.0 > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > 1. record_queue_time_avg > Regression Details > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Labels: kip-951 (was: ) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Labels: kip-951 > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > 1. record_queue_time_avg > Regression Details > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts
[ https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Shekhar Narula updated KAFKA-16226: -- Summary: Java client: Performance regression in Trogdor benchmark with high partition counts (was: Performance regression in Trogdor benchmark with high partition counts) > Java client: Performance regression in Trogdor benchmark with high partition > counts > --- > > Key: KAFKA-16226 > URL: https://issues.apache.org/jira/browse/KAFKA-16226 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Mayank Shekhar Narula >Assignee: Mayank Shekhar Narula >Priority: Major > Fix For: 3.6.2, 3.8.0, 3.7.1 > > > Background > https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in > java-client to skip backoff period if client knows of a newer leader, for > produce-batch being retried. > What changed > The implementation introduced a regression noticed on a trogdor-benchmark > running with high partition counts(36000!). > With regression, following metrics changed on the produce side. > 1. record_queue_time_avg > Regression Details > Fix -- This message was sent by Atlassian Jira (v8.20.10#820010)