date:20240206

[jira] [Updated] (KAFKA-16228) Add --remote-log-metadata-decoder to kafka-dump-log.sh

2024-02-06 Thread Federico Valeri (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Federico Valeri updated KAFKA-16228:

Labels: kip-required  (was: )

> Add --remote-log-metadata-decoder to kafka-dump-log.sh
> --
>
> Key: KAFKA-16228
> URL: https://issues.apache.org/jira/browse/KAFKA-16228
> Project: Kafka
>  Issue Type: New Feature
>  Components: Tiered-Storage
>Affects Versions: 3.6.1
>Reporter: Federico Valeri
>Priority: Major
>  Labels: kip-required
>
> It would be good to improve the kafka-dump-log.sh tool adding a decode flag 
> for __remote_log_metadata records. Something like the following would be 
> useful for debugging.
> {code}
> bin/kafka-dump-log.sh --remote-log-metadata-decoder --files 
> /opt/kafka/data/__remote_log_metadata-0/.log 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16218) Partition reassignment can't complete if any target replica is out-of-sync

2024-02-06 Thread Drawxy (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drawxy updated KAFKA-16218:
---
Issue Type: Improvement  (was: Bug)

> Partition reassignment can't complete if any target replica is out-of-sync
> --
>
> Key: KAFKA-16218
> URL: https://issues.apache.org/jira/browse/KAFKA-16218
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.1.2
>Reporter: Drawxy
>Priority: Major
>
> Assumed that there were 4 brokers (1001,2001,3001,4001) and a topic partition 
> _foo-0_ (replicas[1001,2001,3001], isr[1001,3001]). The replica 2001 can't 
> catch up and become out-of-sync due to some issue.
> If we launch a partition reassinment for this _foo-0_ (the target replica 
> list is [1001,2001,4001]), the partition reassignment can't complete even if 
> the adding replica 4001 already catches up. At that time, the partition state 
> would be replicas[1001,2001,4001,3001] isr[1001,3001,4001].
>  
> The out-of-sync replica 2001 shouldn't make the partition reassignment stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] MINOR fix word spelling mistakes [kafka]

2024-02-06 Thread via GitHub



showuon merged PR #15331:
URL: https://github.com/apache/kafka/pull/15331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (KAFKA-16232) kafka hangs forever in the starting process if the authorizer future is not returned

2024-02-06 Thread Luke Chen (Jira)

Luke Chen created KAFKA-16232:
-

 Summary: kafka hangs forever in the starting process if the 
authorizer future is not returned
 Key: KAFKA-16232
 URL: https://issues.apache.org/jira/browse/KAFKA-16232
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 3.6.1
Reporter: Luke Chen


For security reason, during broker startup, we will wait until all ACL entries 
loaded before starting serving requests. But recently, we accidentally set 
standardAuthorizer to ZK broker, and then, the broker never enters RUNNING 
state because it's waiting for the  standardAuthorizer future completion. Of 
course this is a human error to set the wrong configuration, but it'd be better 
we could handle this case better. Suggestions:
1. set timeout for authorizer future waiting (how long is long enough?)
2. add logs before and after future waiting, to allow admin to know we're 
waiting for the authorizer future.

We can start with (2), and thinking about (1) later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-12823) Remove Deprecated method KStream#through

2024-02-06 Thread Matthias J. Sax (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815089#comment-17815089
 ] 

Matthias J. Sax commented on KAFKA-12823:
-

Look for ticket labeled "beginner" or "newbie": 
https://issues.apache.org/jira/browse/KAFKA-16209?jql=project%20%3D%20KAFKA%20AND%20labels%20in%20(Beginner%2C%20beginner%2C%20newbie%2C%20%22newbie%2B%2B%22)%20ORDER%20BY%20created%20DESC%2C%20priority%20DESC%2C%20updated%20DESC
 

> Remove Deprecated method KStream#through
> 
>
> Key: KAFKA-12823
> URL: https://issues.apache.org/jira/browse/KAFKA-12823
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: Josep Prat
>Priority: Blocker
> Fix For: 4.0.0
>
>
> The method through in Java and Scala class KStream was deprecated in version 
> 2.6:
>  * org.apache.kafka.streams.scala.kstream.KStream#through
>  * org.apache.kafka.streams.kstream.KStream#through(java.lang.String)
>  * org.apache.kafka.streams.kstream.KStream#through(java.lang.String, 
> org.apache.kafka.streams.kstream.Produced)
>  
> See KAFKA-10003 and KIP-221



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] MINOR fix word spelling mistakes [kafka]

2024-02-06 Thread via GitHub



eliasyaoyc opened a new pull request, #15331:
URL: https://github.com/apache/kafka/pull/15331

   - fix word spelling mistakes in KafkaRaftClient
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-14517: Implement regex subscriptions [kafka]

2024-02-06 Thread via GitHub



JimmyWang6 commented on PR #14327:
URL: https://github.com/apache/kafka/pull/14327#issuecomment-1931122011

   > Thanks for your reply！
   > I will remove this part of code
   
   @dajac
   I'm sorry that I replied with the wrong content. Here is my email address: 
[www.wangzhiw...@qq.com](www.wangzhiw...@qq.com) and much thanks for your 
invitation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] KAFKA-16231: Update consumer_test.py to support KIP-848’s group protocol config [kafka]

2024-02-06 Thread via GitHub



kirktrue opened a new pull request, #15330:
URL: https://github.com/apache/kafka/pull/15330

   Added a new optional `group_protocol` parameter to the test methods, then 
passed that down to the `setup_consumer` method.
   
   Unfortunately, because the new consumer can only be used with the new 
coordinator, this required a new `@matrix` block instead of adding the 
`group_protocol=["classic", "consumer"]` to the existing blocks  
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-14567) Kafka Streams crashes after ProducerFencedException

2024-02-06 Thread Matthias J. Sax (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-14567:

Description: 
Running a Kafka Streams application with EOS-v2.

We first see a `ProducerFencedException`. After the fencing, the fenced thread 
crashed resulting in a non-recoverable error:
{quote}[2022-12-22 13:49:13,423] ERROR [i-0c291188ec2ae17a0-StreamThread-3] 
stream-thread [i-0c291188ec2ae17a0-StreamThread-3] Failed to process stream 
task 1_2 due to the following error: 
(org.apache.kafka.streams.processor.internals.TaskExecutor)
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. 
taskId=1_2, processor=KSTREAM-SOURCE-05, topic=node-name-repartition, 
partition=2, offset=539776276, stacktrace=java.lang.IllegalStateException: 
TransactionalId stream-soak-test-72b6e57c-c2f5-489d-ab9f-fdbb215d2c86-3: 
Invalid transition attempted from state FATAL_ERROR to state ABORTABLE_ERROR
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:974)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:394)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.maybeTransitionToErrorState(TransactionManager.java:620)
at 
org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:1079)
at 
org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:959)
at 
org.apache.kafka.streams.processor.internals.StreamsProducer.send(StreamsProducer.java:257)
at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:207)
at 
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:162)
at 
org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:85)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:290)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:269)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:228)
at 
org.apache.kafka.streams.kstream.internals.KStreamKTableJoinProcessor.process(KStreamKTableJoinProcessor.java:88)
at 
org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:157)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forwardInternal(ProcessorContextImpl.java:290)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:269)
at 
org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:228)
at 
org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
at 
org.apache.kafka.streams.processor.internals.StreamTask.lambda$doProcess$1(StreamTask.java:791)
at 
org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:867)
at 
org.apache.kafka.streams.processor.internals.StreamTask.doProcess(StreamTask.java:791)
at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:722)
at 
org.apache.kafka.streams.processor.internals.TaskExecutor.processTask(TaskExecutor.java:95)
at 
org.apache.kafka.streams.processor.internals.TaskExecutor.process(TaskExecutor.java:76)
at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:1645)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:788)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:607)
at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:569)
at 
org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:748)
at 
org.apache.kafka.streams.processor.internals.TaskExecutor.processTask(TaskExecutor.java:95)
at 
org.apache.kafka.streams.processor.internals.TaskExecutor.process(TaskExecutor.java:76)
at 
org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:1645)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:788)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:607)
at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:569)
Caused by: java.lang.IllegalStateException: TransactionalId 
stream-soak-test-72b6e57c-c2f5-489d-ab9f-fdbb215d2c86-3: Invalid transition 
attempted from state FATAL_ERROR to state ABORTABLE_ERROR

[PR] KIP848- Add JMH Benchmarks for Client And Server Side Assignors [kafka]

2024-02-06 Thread via GitHub



rreddy-22 opened a new pull request, #15329:
URL: https://github.com/apache/kafka/pull/15329

   This PR contains the addition of two JMH benchmark workloads to test the 
performance of the different assignors present on both the client side and the 
new server side assignors (KIP-848).
   
   For the purpose of testing we've used the following assumptions:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] KAFKA-16230: Update verifiable_consumer.py to support KIP-848’s group protocol config [kafka]

2024-02-06 Thread via GitHub



kirktrue opened a new pull request, #15328:
URL: https://github.com/apache/kafka/pull/15328

   Including the new `--group-protocol` command line option to 
`VerifiableConsumer` (from KAFKA-16037/#15325) if the node is running 3.7.0+ of 
the consumer
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (KAFKA-15538) Client support for java regex based subscription

2024-02-06 Thread Phuc Hong Tran (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815041#comment-17815041
 ] 

Phuc Hong Tran commented on KAFKA-15538:


[~lianetm] I think instead of closing this ticket we can move the part of 
including Pattern into heartbeat request to ticket KAFKA-15561, as the 
foundation work for it is already there. This ticket will be exclusive to 
enabling relating tests of the subscribe methods that use Pattern, wdyt?

> Client support for java regex based subscription
> 
>
> Key: KAFKA-15538
> URL: https://issues.apache.org/jira/browse/KAFKA-15538
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Phuc Hong Tran
>Priority: Critical
>  Labels: kip-848-client-support, newbie, regex
> Fix For: 3.8.0
>
>
> When using subscribe with a java regex (Pattern), we need to resolve it on 
> the client side to send the broker a list of topic names to subscribe to.
> Context:
> The new consumer group protocol uses [Google 
> RE2/J|https://github.com/google/re2j] for regular expressions and introduces 
> new methods in the consumer API to subscribe using a `SubscribePattern`. The 
> subscribe using a java `Pattern` will be still supported for a while but 
> eventually removed.
>  * When the subscribe with SubscriptionPattern is used, the client should 
> just send the regex to the broker and it will be resolved on the server side.
>  * In the case of the subscribe with Pattern, the regex should be resolved on 
> the client side.
> As part of this task, we should re-enable all integration tests defined in 
> the PlainTextAsyncConsumer that relate to subscription with pattern and that 
> are currently disabled for the new consumer + new protocol



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-7663: Custom Processor supplied on addGlobalStore is not used when restoring state from topic [kafka]

2024-02-06 Thread via GitHub



wcarlson5 closed pull request #15326: KAFKA-7663: Custom Processor supplied on 
addGlobalStore is not used when restoring state from topic
URL: https://github.com/apache/kafka/pull/15326


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] MINOR Fix a case where not all ACLs for a given resource are written to ZK [kafka]

2024-02-06 Thread via GitHub



mumrah opened a new pull request, #15327:
URL: https://github.com/apache/kafka/pull/15327

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] KAFKA-7663: Custom Processor supplied on addGlobalStore is not used when restoring state from topic [kafka]

2024-02-06 Thread via GitHub



wcarlson5 opened a new pull request, #15326:
URL: https://github.com/apache/kafka/pull/15326

   The global table is reloaded but without going through the processor 
supplied; instead, it calls GlobalStateManagerImp#restoreState which simply 
stores the input topic K,V records into rocksDB. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16226 Reduce synchronization between producer threads [kafka]

2024-02-06 Thread via GitHub



hachikuji commented on code in PR #15323:
URL: https://github.com/apache/kafka/pull/15323#discussion_r1480620848


##
clients/src/main/java/org/apache/kafka/clients/producer/internals/RecordAccumulator.java:
##
@@ -647,27 +647,27 @@ private long batchReady(boolean exhausted, TopicPartition 
part, Node leader,
 }
 
 /**
- * Iterate over partitions to see which one have batches ready and collect 
leaders of those partitions
- * into the set of ready nodes.  If partition has no leader, add the topic 
to the set of topics with
- * no leader.  This function also calculates stats for adaptive 
partitioning.
+ * Iterate over partitions to see which one have batches ready and collect 
leaders of those
+ * partitions into the set of ready nodes.  If partition has no leader, 
add the topic to the set
+ * of topics with no leader.  This function also calculates stats for 
adaptive partitioning.
  *
- * @param metadata The cluster metadata
- * @param nowMs The current time
- * @param topic The topic
- * @param topicInfo The topic info
+ * @param cluster   The cluster metadata
+ * @param nowMs The current time
+ * @param topic The topic
+ * @param topicInfo The topic info
  * @param nextReadyCheckDelayMs The delay for next check
- * @param readyNodes The set of ready nodes (to be filled in)
- * @param unknownLeaderTopics The set of topics with no leader (to be 
filled in)
+ * @param readyNodesThe set of ready nodes (to be filled in)
+ * @param unknownLeaderTopics   The set of topics with no leader (to be 
filled in)
  * @return The delay for next check
  */
-private long partitionReady(Metadata metadata, long nowMs, String topic,
+private long partitionReady(Cluster cluster, long nowMs, String topic,

Review Comment:
   In some ways, this is a step backwards. We have been trying to reduce the 
reliance on `Cluster` internally because it is public. With a lot of internal 
usage, we end up making changes to the API which are only needed for the 
internal implementation (as we are doing in this PR). Have you considered 
alternatives? Perhaps we could expose something like `Cluster`, but with a 
reduced scope?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16206 Fix unnecessary topic config deletion during ZK migration [kafka]

2024-02-06 Thread via GitHub



ahuang98 commented on PR #14206:
URL: https://github.com/apache/kafka/pull/14206#issuecomment-1930876640

   Is there a downside to having `deleteTopic` in `ZkTopicMigrationClient` not 
delete configs? Otherwise changing the logging level seems okay to me. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16055: Thread unsafe use of HashMap stored in QueryableStoreProvider#storeProviders [kafka]

2024-02-06 Thread via GitHub



wcarlson5 commented on PR #15121:
URL: https://github.com/apache/kafka/pull/15121#issuecomment-1930810239

   @ableegoldman I think you have context on this issue. 
   
   https://lists.apache.org/thread/gpct1275bfqovlckptn3lvf683qpoxol


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16215; KAFKA-16178: Fix member not rejoining after error [kafka]

2024-02-06 Thread via GitHub



lianetm commented on PR #15311:
URL: https://github.com/apache/kafka/pull/15311#issuecomment-1930740037

   Hey @dajac , I updated this to ensure we record a failed attempt for all 
errors in HB. That will effectively update the received time and backoff, with 
the ability to skip backoff (0 backoff) in some specific errors where the next 
HB depends on other conditions (coordinator discovered, assignment released) 
and we won't add any extra backoff to that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] KAFKA-16037: Update VerifiableConsumer to support KIP-848’s group protocol config [kafka]

2024-02-06 Thread via GitHub



kirktrue opened a new pull request, #15325:
URL: https://github.com/apache/kafka/pull/15325

   Add the optional `--group-protocol` command line option that can be set in 
the system tests
   
   There are no existing unit tests for `VerifiableConsumer`. It was tested by 
running the system tests locally without regression. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16231) Update consumer_test.py to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16231:
--
Description: 
This task is to update {{consumer_test.py}} to support the {{group.protocol}} 
configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{group_protocol}} argument to the tests and matrixes.

For example, here's how it would look to add the new group_protocol parameter 
to the parameterized tests:
{code:python}
@cluster(num_nodes=6)
@matrix(
assignment_strategy=["org.apache.kafka.clients.consumer.RangeAssignor",
 
"org.apache.kafka.clients.consumer.RoundRobinAssignor",
 
"org.apache.kafka.clients.consumer.StickyAssignor"], 
metadata_quorum=[quorum.zk],
use_new_coordinator=[False]
)
@matrix(
assignment_strategy=["org.apache.kafka.clients.consumer.RangeAssignor",
 
"org.apache.kafka.clients.consumer.RoundRobinAssignor",
 
"org.apache.kafka.clients.consumer.StickyAssignor"], 
metadata_quorum=[quorum.isolated_kraft],
use_new_coordinator=[False]
)
@matrix(
    metadata_quorum=[quorum.isolated_kraft],
    use_new_coordinator=[True],
    group_protocol=["classic", "consumer"]
)
def test_the_consumer(self, assignment_strategy, metadata_quorum=quorum.zk, 
use_new_coordinator=False, group_protocol="classic"):
    consumer = self.setup_consumer("my_topic", 
group_protocol=group_protocol)
{code}
The {{group_protocol}} parameter will default to {{{}classic{}}}.

{*}Note{*}: we only test the new group protocol setting when 
{{use_new_coordinator}} is {{{}True{}}}, as that is the only supported mode.

  was:This task is to update {{verifiable_consumer.py}} to support the 
{{group.protocol}} configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{group_protocol}} argument. It will default to classic 
and we will take a separate task (Jira) to update the callers.


> Update consumer_test.py to support KIP-848’s group protocol config
> --
>
> Key: KAFKA-16231
> URL: https://issues.apache.org/jira/browse/KAFKA-16231
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to update {{consumer_test.py}} to support the {{group.protocol}} 
> configuration introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
>  by adding an optional {{group_protocol}} argument to the tests and matrixes.
> For example, here's how it would look to add the new group_protocol parameter 
> to the parameterized tests:
> {code:python}
> @cluster(num_nodes=6)
> @matrix(
> 
> assignment_strategy=["org.apache.kafka.clients.consumer.RangeAssignor",
>  
> "org.apache.kafka.clients.consumer.RoundRobinAssignor",
>  
> "org.apache.kafka.clients.consumer.StickyAssignor"], 
> metadata_quorum=[quorum.zk],
> use_new_coordinator=[False]
> )
> @matrix(
> 
> assignment_strategy=["org.apache.kafka.clients.consumer.RangeAssignor",
>  
> "org.apache.kafka.clients.consumer.RoundRobinAssignor",
>  
> "org.apache.kafka.clients.consumer.StickyAssignor"], 
> metadata_quorum=[quorum.isolated_kraft],
> use_new_coordinator=[False]
> )
> @matrix(
>     metadata_quorum=[quorum.isolated_kraft],
>     use_new_coordinator=[True],
>     group_protocol=["classic", "consumer"]
> )
> def test_the_consumer(self, assignment_strategy, 
> metadata_quorum=quorum.zk, use_new_coordinator=False, 
> group_protocol="classic"):
>     consumer = self.setup_consumer("my_topic", 
> group_protocol=group_protocol)
> {code}
> The {{group_protocol}} parameter will default to {{{}classic{}}}.
> {*}Note{*}: we only test the new group protocol setting when 
> {{use_new_coordinator}} is {{{}True{}}}, as that is the only supported mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16231) Update consumer_test.py to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)

Kirk True created KAFKA-16231:
-

 Summary: Update consumer_test.py to support KIP-848’s group 
protocol config
 Key: KAFKA-16231
 URL: https://issues.apache.org/jira/browse/KAFKA-16231
 Project: Kafka
  Issue Type: Test
  Components: clients, consumer, system tests
Reporter: Kirk True
Assignee: Kirk True
 Fix For: 3.8.0


This task is to update {{verifiable_consumer.py}} to support the 
{{group.protocol}} configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{group_protocol}} argument. It will default to classic 
and we will take a separate task (Jira) to update the callers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16230) Update verifiable_consumer.py to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16230:
--
Summary: Update verifiable_consumer.py to support KIP-848’s group protocol 
config  (was: Update verifiable_consumer to support KIP-848’s group protocol 
config)

> Update verifiable_consumer.py to support KIP-848’s group protocol config
> 
>
> Key: KAFKA-16230
> URL: https://issues.apache.org/jira/browse/KAFKA-16230
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to update {{verifiable_consumer.py}} to support the 
> {{group.protocol}} configuration introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
>  by adding an optional {{group_protocol}} argument. It will default to 
> classic and we will take a separate task (Jira) to update the callers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16230) Update verifiable_consumer to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16230:
--
Description: This task is to update {{verifiable_consumer.py}} to support 
the {{group.protocol}} configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{group_protocol}} argument. It will default to classic 
and we will take a separate task (Jira) to update the callers.  (was: This task 
is to update {{VerifiableConsumer}} to support the {{group.protocol}} 
configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{--group-protocol}} command line option.)

> Update verifiable_consumer to support KIP-848’s group protocol config
> -
>
> Key: KAFKA-16230
> URL: https://issues.apache.org/jira/browse/KAFKA-16230
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to update {{verifiable_consumer.py}} to support the 
> {{group.protocol}} configuration introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
>  by adding an optional {{group_protocol}} argument. It will default to 
> classic and we will take a separate task (Jira) to update the callers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16230) Update verifiable_consumer to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)

Kirk True created KAFKA-16230:
-

 Summary: Update verifiable_consumer to support KIP-848’s group 
protocol config
 Key: KAFKA-16230
 URL: https://issues.apache.org/jira/browse/KAFKA-16230
 Project: Kafka
  Issue Type: Test
  Components: clients, consumer, system tests
Reporter: Kirk True
Assignee: Kirk True
 Fix For: 3.8.0


This task is to update {{VerifiableConsumer}} to support the {{group.protocol}} 
configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{--group-protocol}} command line option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16037) Update VerifiableConsumer to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16037:
--
Description: This task is to update {{VerifiableConsumer}} to support the 
{{group.protocol}} configuration introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
 by adding an optional {{--group-protocol}} command line option.  (was: This 
task is to allow the  the system tests so that they execute the Consumer for 
both of the group protocols introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:

 * classic
 * consumer

This is done in a few steps...

First, update the Java-based VerifiableConsumer to support passing in the group 
protocol



by adding a new group_protocol parameter to the relevant parameterized tests:

{code:python}
@matrix(
metadata_quorum=[quorum.isolated_kraft],
use_new_coordinator=[True],
group_protocol=["classic", "consumer"]
)

def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
use_new_coordinator=False):
consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
{code}
The VerifiableConsumer class in Python represents the calling of the 
VerifiableConsumer class in Java. The  )

> Update VerifiableConsumer to support KIP-848’s group protocol config
> 
>
> Key: KAFKA-16037
> URL: https://issues.apache.org/jira/browse/KAFKA-16037
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to update {{VerifiableConsumer}} to support the 
> {{group.protocol}} configuration introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]
>  by adding an optional {{--group-protocol}} command line option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16037) Update VerifiableConsumer to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16037:
--
Description: 
This task is to allow the  the system tests so that they execute the Consumer 
for both of the group protocols introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:

 * classic
 * consumer

This is done in a few steps...

First, update the Java-based VerifiableConsumer to support passing in the group 
protocol



by adding a new group_protocol parameter to the relevant parameterized tests:

{code:python}
@matrix(
metadata_quorum=[quorum.isolated_kraft],
use_new_coordinator=[True],
group_protocol=["classic", "consumer"]
)

def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
use_new_coordinator=False):
consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
{code}
The VerifiableConsumer class in Python represents the calling of the 
VerifiableConsumer class in Java. The  

  was:
This task is to parameterize the system tests so that they execute the Consumer 
for both of the group protocols introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:

 * classic
 * consumer

This is done in a few steps...

First, update the Java-based VerifiableConsumer to support passing in the group 
protocol



by adding a new group_protocol parameter to the relevant parameterized tests:

{code:python}
@matrix(
metadata_quorum=[quorum.isolated_kraft],
use_new_coordinator=[True],
group_protocol=["classic", "consumer"]
)

def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
use_new_coordinator=False):
consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
{code}
The VerifiableConsumer class in Python represents the calling of the 
VerifiableConsumer class in Java. The  


> Update VerifiableConsumer to support KIP-848’s group protocol config
> 
>
> Key: KAFKA-16037
> URL: https://issues.apache.org/jira/browse/KAFKA-16037
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to allow the  the system tests so that they execute the Consumer 
> for both of the group protocols introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:
>  * classic
>  * consumer
> This is done in a few steps...
> First, update the Java-based VerifiableConsumer to support passing in the 
> group protocol
> by adding a new group_protocol parameter to the relevant parameterized tests:
> {code:python}
> @matrix(
> metadata_quorum=[quorum.isolated_kraft],
> use_new_coordinator=[True],
> group_protocol=["classic", "consumer"]
> )
> def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
> use_new_coordinator=False):
> consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
> {code}
> The VerifiableConsumer class in Python represents the calling of the 
> VerifiableConsumer class in Java. The  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16037) Update VerifiableConsumer to support KIP-848’s group protocol config

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16037:
--
Summary: Update VerifiableConsumer to support KIP-848’s group protocol 
config  (was: Upgrade existing system tests to use new consumer)

> Update VerifiableConsumer to support KIP-848’s group protocol config
> 
>
> Key: KAFKA-16037
> URL: https://issues.apache.org/jira/browse/KAFKA-16037
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to parameterize the system tests so that they execute the 
> Consumer for both of the group protocols introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:
>  * classic
>  * consumer
> This is done in a few steps...
> First, update the Java-based VerifiableConsumer to support passing in the 
> group protocol
> by adding a new group_protocol parameter to the relevant parameterized tests:
> {code:python}
> @matrix(
> metadata_quorum=[quorum.isolated_kraft],
> use_new_coordinator=[True],
> group_protocol=["classic", "consumer"]
> )
> def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
> use_new_coordinator=False):
> consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
> {code}
> The VerifiableConsumer class in Python represents the calling of the 
> VerifiableConsumer class in Java. The  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16037) Upgrade existing system tests to use new consumer

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16037:
--
Description: 
This task is to parameterize the system tests so that they execute the Consumer 
for both of the group protocols introduced in 
[KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:

 * classic
 * consumer

This is done in a few steps...

First, update the Java-based VerifiableConsumer to support passing in the group 
protocol



by adding a new group_protocol parameter to the relevant parameterized tests:

{code:python}
@matrix(
metadata_quorum=[quorum.isolated_kraft],
use_new_coordinator=[True],
group_protocol=["classic", "consumer"]
)

def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
use_new_coordinator=False):
consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
{code}
The VerifiableConsumer class in Python represents the calling of the 
VerifiableConsumer class in Java. The  

  was:This task is to parameterize the tests to run twice: both for the old and 
the new Consumer.


> Upgrade existing system tests to use new consumer
> -
>
> Key: KAFKA-16037
> URL: https://issues.apache.org/jira/browse/KAFKA-16037
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to parameterize the system tests so that they execute the 
> Consumer for both of the group protocols introduced in 
> [KIP-848|https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol]:
>  * classic
>  * consumer
> This is done in a few steps...
> First, update the Java-based VerifiableConsumer to support passing in the 
> group protocol
> by adding a new group_protocol parameter to the relevant parameterized tests:
> {code:python}
> @matrix(
> metadata_quorum=[quorum.isolated_kraft],
> use_new_coordinator=[True],
> group_protocol=["classic", "consumer"]
> )
> def test_the_consumer(self, group_protocol, metadata_quorum=quorum.zk, 
> use_new_coordinator=False):
> consumer = self.setup_consumer("my_topic", group_protocol=group_protocol)
> {code}
> The VerifiableConsumer class in Python represents the calling of the 
> VerifiableConsumer class in Java. The  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-16037) Upgrade existing system tests to use new consumer

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True reassigned KAFKA-16037:
-

Assignee: Kirk True  (was: Dongnuo Lyu)

> Upgrade existing system tests to use new consumer
> -
>
> Key: KAFKA-16037
> URL: https://issues.apache.org/jira/browse/KAFKA-16037
> Project: Kafka
>  Issue Type: Test
>  Components: clients, consumer, system tests
>Reporter: Kirk True
>Assignee: Kirk True
>Priority: Blocker
>  Labels: kip-848-client-support, system-tests
> Fix For: 3.8.0
>
>
> This task is to parameterize the tests to run twice: both for the old and the 
> new Consumer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-16215; KAFKA-16178: Fix member not rejoining after error [kafka]

2024-02-06 Thread via GitHub



lianetm commented on code in PR #15311:
URL: https://github.com/apache/kafka/pull/15311#discussion_r1480405280


##
clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestState.java:
##
@@ -106,6 +106,13 @@ public void onSendAttempt(final long currentTimeMs) {
 this.lastSentMs = currentTimeMs;
 }
 
+/**
+ * Update the lastReceivedTime in milliseconds, indicating that a response 
has been received.
+ */
+public void updateLastReceivedTime(final long lastReceivedMs) {
+this.lastReceivedMs = lastReceivedMs;

Review Comment:
   Exactly, I did see the skip backoff logic in produce when a new leader is 
discovered, related to what you described. I think that the concept of 
"progress" here would be more abstract but still applicable, depending on the 
exact error we know that the action it triggers is based on some progress (send 
HB as new member, send HB when coord available) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16227:
--
Priority: Critical  (was: Major)

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Critical
>  Labels: kip-848-client-support
> Fix For: 3.8.0
>
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True reassigned KAFKA-16227:
-

Assignee: (was: Kirk True)

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Priority: Critical
>  Labels: kip-848-client-support
> Fix For: 3.8.0
>
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16227:
--
Labels: kip-848-client-support  (was: consumer-threading-refactor)

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>  Labels: kip-848-client-support
> Fix For: 3.8.0
>
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16227:
--
Fix Version/s: 3.8.0

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>  Labels: consumer-threading-refactor
> Fix For: 3.8.0
>
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16227:
--
Component/s: consumer

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Kirk True (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk True updated KAFKA-16227:
--
Labels: consumer-threading-refactor  (was: )

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>  Labels: consumer-threading-refactor
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-15761: KRaft support in EpochDrivenReplicationProtocolAcceptanceTest [kafka]

2024-02-06 Thread via GitHub



mimaison commented on PR #15295:
URL: https://github.com/apache/kafka/pull/15295#issuecomment-1930519984

   The build is still failing with Java 8 and Scala 2.12 so we can't merge this 
PR as is.
   You should be able to reproduce the issue by running: `./gradlew 
-PscalaVersion=2.12 core:compileTestScala`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] MINOR:Type Casting Correction AND Null Pointer Exception (NPE) Defense [kafka]

2024-02-06 Thread via GitHub



mimaison commented on code in PR #9786:
URL: https://github.com/apache/kafka/pull/9786#discussion_r1480365763


##
streams/src/main/java/org/apache/kafka/streams/kstream/internals/InternalStreamsBuilder.java:
##
@@ -437,7 +438,10 @@ private void rewriteSingleStoreSelfJoin(
 if (currentNode instanceof StreamStreamJoinNode && 
currentNode.parentNodes().size() == 1) {
 final StreamStreamJoinNode joinNode = (StreamStreamJoinNode) 
currentNode;
 // Remove JoinOtherWindowed node
-final GraphNode parent = 
joinNode.parentNodes().stream().findFirst().get();
+final GraphNode parent = joinNode.parentNodes().stream()

Review Comment:
   It seems we checked there is one item in parentNodes in the `if` condition 
just above, so do we really need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (KAFKA-7663) Custom Processor supplied on addGlobalStore is not used when restoring state from topic

2024-02-06 Thread Walker Carlson (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walker Carlson reassigned KAFKA-7663:
-

Assignee: Walker Carlson

> Custom Processor supplied on addGlobalStore is not used when restoring state 
> from topic
> ---
>
> Key: KAFKA-7663
> URL: https://issues.apache.org/jira/browse/KAFKA-7663
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Frederic Tardif
>Assignee: Walker Carlson
>Priority: Major
>  Labels: new-streams-runtime-should-fix
> Attachments: image-2018-11-20-11-42-14-697.png
>
>
> I have implemented a StreamBuilder#{{addGlobalStore}} supplying a custom 
> processor responsible to transform a K,V record from the input stream into a 
> V,K records. It works fine and my {{store.all()}} does print the correct 
> persisted V,K records. However, if I clean the local store and restart the 
> stream app, the global table is reloaded but without going through the 
> processor supplied; instead, it calls {{GlobalStateManagerImp#restoreState}} 
> which simply stores the input topic K,V records into rocksDB (hence bypassing 
> the mapping function of my custom processor). I believe this must not be the 
> expected result?
>  This is a follow up on stackoverflow discussion around storing a K,V topic 
> as a global table with some stateless transformations based on a "custom" 
> processor added on the global store:
> [https://stackoverflow.com/questions/50993292/kafka-streams-shared-changelog-topic#comment93591818_50993729]
> If we address this issue, we should also apply 
> `default.deserialization.exception.handler` during restore (cf. KAFKA-8037)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] MINOR: Update LICENSE-binary file [kafka]

2024-02-06 Thread via GitHub



mimaison merged PR #15322:
URL: https://github.com/apache/kafka/pull/15322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16229: Fix slow expired producer id deletion [kafka]

2024-02-06 Thread via GitHub



jolshan commented on PR #15324:
URL: https://github.com/apache/kafka/pull/15324#issuecomment-1930479050

   Hey @jeqo thanks for taking a look and improving this area!
   
   Can we add the benchmarks from the ticket to the PR description?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (KAFKA-16133) Commits during reconciliation always time out

2024-02-06 Thread Lianet Magrans (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianet Magrans resolved KAFKA-16133.

Fix Version/s: 3.7.0
   (was: 3.8.0)
   Resolution: Fixed

> Commits during reconciliation always time out
> -
>
> Key: KAFKA-16133
> URL: https://issues.apache.org/jira/browse/KAFKA-16133
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer
>Affects Versions: 3.7.0
>Reporter: Lucas Brutschy
>Assignee: Lianet Magrans
>Priority: Critical
>  Labels: consumer-threading-refactor, reconciliation, timeout
> Fix For: 3.7.0
>
>
> This only affects the AsyncKafkaConsumer, which is in Preview in 3.7.
> In MembershipManagerImpl there is a confusion between timeouts and deadlines. 
> [https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/MembershipManagerImpl.java#L836C38-L836C38]
> This causes all autocommits during reconciliation to immediately time out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16229) Slow expiration of Producer IDs leading to high CPU usage

2024-02-06 Thread Jorge Esteban Quilcate Otoya (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Esteban Quilcate Otoya updated KAFKA-16229:
-
Description: 
Expiration of ProducerIds is implemented with a slow removal of map keys:

```
        producers.keySet().removeAll(keys);
```

Unnecessarily going through all producer ids and then throw all expired keys to 
be removed.
This leads to exponential time on worst case when most/all keys need to be 
removed:

```
Benchmark                                        (numProducerIds)  Mode  Cnt    
       Score            Error  Units
ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3    
    9164.043 ±      10647.877  ns/op
ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3    
  341561.093 ±      20283.211  ns/op
ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3    
44957983.550 ±    9389011.290  ns/op
ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
5683374164.167 ± 1446242131.466  ns/op
```

A simple fix is to use map#remove(key) instead, leading to a more linear growth:

```

Benchmark                                        (numProducerIds)  Mode  Cnt    
    Score         Error  Units
ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3    
 5779.056 ±     651.389  ns/op
ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3    
61430.530 ±   21875.644  ns/op
ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3   
643887.031 ±  600475.302  ns/op
ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
7741689.539 ± 3218317.079  ns/op

```

  was:
Expiration of ProducerIds is implemented with a slow removal of map keys:

```
        producers.keySet().removeAll(keys);
```
 
Unnecessarily going through all producer ids and then throw all expired keys to 
be removed.
This leads to exponential time on worst case when most/all keys need to be 
removed:

```
Benchmark                                        (numProducerIds)  Mode  Cnt    
       Score            Error  Units
ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3    
    9164.043 ±      10647.877  ns/op
ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3    
  341561.093 ±      20283.211  ns/op
ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3    
44957983.550 ±    9389011.290  ns/op
ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
5683374164.167 ± 1446242131.466  ns/op
```

A simple fix is to use map#remove(key) instead, leading to a more linear growth:

```
Benchmark(numProducerIds)  Mode  Cnt
Score Error  Units
ProducerStateManagerBench.testDeleteExpiringIds   100  avgt3
 5779.056 ± 651.389  ns/op
ProducerStateManagerBench.testDeleteExpiringIds  1000  avgt3
61430.530 ±   21875.644  ns/op
ProducerStateManagerBench.testDeleteExpiringIds 1  avgt3   
643887.031 ±  600475.302  ns/op
ProducerStateManagerBench.testDeleteExpiringIds10  avgt3  
7741689.539 ± 3218317.079  ns/op
```


> Slow expiration of Producer IDs leading to high CPU usage
> -
>
> Key: KAFKA-16229
> URL: https://issues.apache.org/jira/browse/KAFKA-16229
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Jorge Esteban Quilcate Otoya
>Priority: Major
>
> Expiration of ProducerIds is implemented with a slow removal of map keys:
> ```
>         producers.keySet().removeAll(keys);
> ```
> Unnecessarily going through all producer ids and then throw all expired keys 
> to be removed.
> This leads to exponential time on worst case when most/all keys need to be 
> removed:
> ```
> Benchmark                                        (numProducerIds)  Mode  Cnt  
>          Score            Error  Units
> ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3  
>       9164.043 ±      10647.877  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3  
>     341561.093 ±      20283.211  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3  
>   44957983.550 ±    9389011.290  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
> 5683374164.167 ± 1446242131.466  ns/op
> ```
> A simple fix is to use map#remove(key) instead, leading to a more linear 
> growth:
> ```
> Benchmark                                        (numProducerIds)  Mode  Cnt  
>       Score         Error  Units
> ProducerStateManagerBench.testDeleteExpiringIds

[PR] KAFKA-16229: Fix slow expired producer id deletion [kafka]

2024-02-06 Thread via GitHub



jeqo opened a new pull request, #15324:
URL: https://github.com/apache/kafka/pull/15324

   [[KAFKA-16229](https://issues.apache.org/jira/browse/KAFKA-16229)]
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16215; KAFKA-16178: Fix member not rejoining after error [kafka]

2024-02-06 Thread via GitHub



AndrewJSchofield commented on code in PR #15311:
URL: https://github.com/apache/kafka/pull/15311#discussion_r1480305816


##
clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestState.java:
##
@@ -106,6 +106,13 @@ public void onSendAttempt(final long currentTimeMs) {
 this.lastSentMs = currentTimeMs;
 }
 
+/**
+ * Update the lastReceivedTime in milliseconds, indicating that a response 
has been received.
+ */
+public void updateLastReceivedTime(final long lastReceivedMs) {
+this.lastReceivedMs = lastReceivedMs;

Review Comment:
   The implementation of exponential backoff for KIP-580 introduced the idea of 
"equivalent responses". The idea is that exponential backoff applies when 
metadata responses are "equivalent", but when things are progressing, the 
backoff is avoided. I think a similar principle could be helpful here. 
Essentially, if "progress" is being made such as a coordinator change, do not 
apply the backoff.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (KAFKA-16229) Slow expiration of Producer IDs leading to high CPU usage

2024-02-06 Thread Jorge Esteban Quilcate Otoya (Jira)

Jorge Esteban Quilcate Otoya created KAFKA-16229:


 Summary: Slow expiration of Producer IDs leading to high CPU usage
 Key: KAFKA-16229
 URL: https://issues.apache.org/jira/browse/KAFKA-16229
 Project: Kafka
  Issue Type: Bug
Reporter: Jorge Esteban Quilcate Otoya
Assignee: Jorge Esteban Quilcate Otoya


Expiration of ProducerIds is implemented with a slow removal of map keys:

```
        producers.keySet().removeAll(keys);
```
 
Unnecessarily going through all producer ids and then throw all expired keys to 
be removed.
This leads to exponential time on worst case when most/all keys need to be 
removed:

```
Benchmark                                        (numProducerIds)  Mode  Cnt    
       Score            Error  Units
ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3    
    9164.043 ±      10647.877  ns/op
ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3    
  341561.093 ±      20283.211  ns/op
ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3    
44957983.550 ±    9389011.290  ns/op
ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
5683374164.167 ± 1446242131.466  ns/op
```

A simple fix is to use map#remove(key) instead, leading to a more linear growth:

```
Benchmark(numProducerIds)  Mode  Cnt
Score Error  Units
ProducerStateManagerBench.testDeleteExpiringIds   100  avgt3
 5779.056 ± 651.389  ns/op
ProducerStateManagerBench.testDeleteExpiringIds  1000  avgt3
61430.530 ±   21875.644  ns/op
ProducerStateManagerBench.testDeleteExpiringIds 1  avgt3   
643887.031 ±  600475.302  ns/op
ProducerStateManagerBench.testDeleteExpiringIds10  avgt3  
7741689.539 ± 3218317.079  ns/op
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16228) Add --remote-log-metadata-decoder to kafka-dump-log.sh

2024-02-06 Thread Federico Valeri (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Federico Valeri updated KAFKA-16228:

Description: 
It would be good to improve the kafka-dump-log.sh tool adding a decode flag for 
__remote_log_metadata records. Something like the following would be useful for 
debugging.

{code}
bin/kafka-dump-log.sh --remote-log-metadata-decoder --files 
/opt/kafka/data/__remote_log_metadata-0/.log 
{code}

  was:
It would be good to improve the kafka-dump-log.sh tool adding a decode flags 
for __remote_log_metadata records. Something like the following would be useful 
for debugging.

{code}
bin/kafka-dump-log.sh --remote-log-metadata-decoder --files 
/opt/kafka/data/__remote_log_metadata-0/.log 
{code}


> Add --remote-log-metadata-decoder to kafka-dump-log.sh
> --
>
> Key: KAFKA-16228
> URL: https://issues.apache.org/jira/browse/KAFKA-16228
> Project: Kafka
>  Issue Type: New Feature
>  Components: Tiered-Storage
>Affects Versions: 3.6.1
>Reporter: Federico Valeri
>Priority: Major
>
> It would be good to improve the kafka-dump-log.sh tool adding a decode flag 
> for __remote_log_metadata records. Something like the following would be 
> useful for debugging.
> {code}
> bin/kafka-dump-log.sh --remote-log-metadata-decoder --files 
> /opt/kafka/data/__remote_log_metadata-0/.log 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16228) Add --remote-log-metadata-decoder to kafka-dump-log.sh

2024-02-06 Thread Federico Valeri (Jira)

Federico Valeri created KAFKA-16228:
---

 Summary: Add --remote-log-metadata-decoder to kafka-dump-log.sh
 Key: KAFKA-16228
 URL: https://issues.apache.org/jira/browse/KAFKA-16228
 Project: Kafka
  Issue Type: New Feature
  Components: Tiered-Storage
Affects Versions: 3.6.1
Reporter: Federico Valeri


It would be good to improve the kafka-dump-log.sh tool adding a decode flags 
for __remote_log_metadata records. Something like the following would be useful 
for debugging.

{code}
bin/kafka-dump-log.sh --remote-log-metadata-decoder --files 
/opt/kafka/data/__remote_log_metadata-0/.log 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-16215; KAFKA-16178: Fix member not rejoining after error [kafka]

2024-02-06 Thread via GitHub



lianetm commented on code in PR #15311:
URL: https://github.com/apache/kafka/pull/15311#discussion_r1480240755


##
clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestState.java:
##
@@ -106,6 +106,13 @@ public void onSendAttempt(final long currentTimeMs) {
 this.lastSentMs = currentTimeMs;
 }
 
+/**
+ * Update the lastReceivedTime in milliseconds, indicating that a response 
has been received.
+ */
+public void updateLastReceivedTime(final long lastReceivedMs) {
+this.lastReceivedMs = lastReceivedMs;

Review Comment:
   Thanks for the feedback. Agree with @dajac's point that we might be carrying 
on with a non-desired backoff by leaving it unchanged in this situation, so we 
definitely need to be setting one, and as @AndrewJSchofield said, all errors 
would update the last receive time and apply "some" backoff. 
   
   That leads to deciding which one, and my intention was 0 backoff in a subset 
of errors. As I see it, there are some specific errors in HB where we don't  
just retry the same request, but rather do a different one, so it makes sense 
to skip backoff as an optimization, ex. 
   1. HB to rejoin as new member after fence error. We would send it right away 
after the fence, skipping backoff
   2. HB to a new coordinator after not_coordinator error. We would send it as 
soon as the new coordinator is discovered, without applying any backoff.
   
   Makes sense? I will update to make sure we always set a backoff, but with 
custom 0 backoff for these cases as the initial intention was, if it makes 
sense. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread Lianet Magrans (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814894#comment-17814894
 ] 

Lianet Magrans commented on KAFKA-16227:


[~kirktrue] Seems like a similar challenge came out in the context of KIP-951, 
might be helpful https://issues.apache.org/jira/browse/KAFKA-15824 when we get 
to this one

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-14822: Allow restricting File and Directory ConfigProviders to specific paths [kafka]

2024-02-06 Thread via GitHub



tinaselenge commented on code in PR #14995:
URL: https://github.com/apache/kafka/pull/14995#discussion_r1480171669


##
clients/src/main/java/org/apache/kafka/common/config/provider/FileConfigProvider.java:
##
@@ -40,7 +42,13 @@ public class FileConfigProvider implements ConfigProvider {
 
 private static final Logger log = 
LoggerFactory.getLogger(FileConfigProvider.class);
 
+public static final String ALLOWED_PATHS_CONFIG = "allowed.paths";
+public static final String ALLOWED_PATHS_DOC = "A comma separated list of 
paths that this config provider is " +
+"allowed to access. If not set, all paths are allowed.";
+private AllowedPaths allowedPaths = null;

Review Comment:
   @gharris1727 you are right. The mock classes extending the provider class 
needed to be updated as they didn't call `configure()` causing test failures.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread David Jacot (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Jacot updated KAFKA-16227:

Affects Version/s: 3.7.0

> Console consumer fails with `IllegalStateException`
> ---
>
> Key: KAFKA-16227
> URL: https://issues.apache.org/jira/browse/KAFKA-16227
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients
>Affects Versions: 3.7.0
>Reporter: David Jacot
>Assignee: Kirk True
>Priority: Major
>
> I have seen a few occurrences like the following one. There is a race between 
> the background thread and the foreground thread. I imagine the following 
> steps:
>  * quickstart-events-2 is assigned by the background thread;
>  * the foreground thread starts the initialization of the partition (e.g. 
> reset offset);
>  * quickstart-events-2 is removed by the background thread;
>  * the initialization completes and quickstart-events-2 does not exist 
> anymore.
>  
> {code:java}
> [2024-02-06 16:21:57,375] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> java.lang.IllegalStateException: No current assignment for partition 
> quickstart-events-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
>   at 
> org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
>   at 
> kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
>   at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
>   at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-15561: Client support for new SubscriptionPattern based subscription [kafka]

2024-02-06 Thread via GitHub



Phuc-Hong-Tran commented on PR #15188:
URL: https://github.com/apache/kafka/pull/15188#issuecomment-1930102760

   @cadonna Just for clarification, when we were talking about "implement and 
test everything up to the point where the field is populated", does that mean 
we're not gonna implement and test the part where the client receive the 
assignment from broker at this stage? (I'm mostly blocked at this part)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-15761: KRaft support in EpochDrivenReplicationProtocolAcceptanceTest [kafka]

2024-02-06 Thread via GitHub



highluck commented on PR #15295:
URL: https://github.com/apache/kafka/pull/15295#issuecomment-1930100354

   @mimaison 
   I tried this and that, but I think I need to study a little more.
   After finishing this, would it be okay to do follow-up PR work on the 
`shouldFollowLeaderEpochBasicWorkflow` function?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h3. Lock Profile: Kafka-15415

!kafka_15415_lock_profile.png!
h3. Lock Profile: Baseline

!baseline_lock_profile.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

With the fix, lock-profile & metrics are similar to baseline.

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h3. Lock Profile: Kafka-15415

!kafka_15415_lock_profile.png!
h3. Lock Profile: Baseline

!baseline_lock_profile.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h3. Lock Profile: Kafka-15415

!kafka_15415_lock_profile.png!
h3. Lock Profile: Baseline

!baseline_lock_profile.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!kafka_15415_lock_profile.png!
h2. Lock Profile: Baseline

!baseline_lock_profile.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: baseline_lock_profile.png

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
> Synchronization has to be reduced between 2 threads in order to address this. 
> [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
> using Metadata.currentLeader() instead rely on Cluster.leaderFor().
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!kafka_15415_lock_profile.png!
h2. Lock Profile: Baseline

!baseline_lock_profile.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: kafka_15415_lock_profile.png

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: baseline_lock_profile.png, kafka_15415_lock_profile.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
> Synchronization has to be reduced between 2 threads in order to address this. 
> [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
> using Metadata.currentLeader() instead rely on Cluster.leaderFor().
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: (was: image-20240201-105752.png)

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
> Synchronization has to be reduced between 2 threads in order to address this. 
> [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
> using Metadata.currentLeader() instead rely on Cluster.leaderFor().
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: (was: Screenshot 2024-02-01 at 11.06.36.png)

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
> Synchronization has to be reduced between 2 threads in order to address this. 
> [https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
> using Metadata.currentLeader() instead rely on Cluster.leaderFor().
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

Synchronization has to be reduced between 2 threads in order to address this. 
[https://github.com/apache/kafka/pull/15323] is a fix for it, as it avoids 
using Metadata.currentLeader() instead rely on Cluster.leaderFor().

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called

[jira] [Created] (KAFKA-16227) Console consumer fails with `IllegalStateException`

2024-02-06 Thread David Jacot (Jira)

David Jacot created KAFKA-16227:
---

 Summary: Console consumer fails with `IllegalStateException`
 Key: KAFKA-16227
 URL: https://issues.apache.org/jira/browse/KAFKA-16227
 Project: Kafka
  Issue Type: Sub-task
  Components: clients
Reporter: David Jacot
Assignee: Kirk True


I have seen a few occurrences like the following one. There is a race between 
the background thread and the foreground thread. I imagine the following steps:
 * quickstart-events-2 is assigned by the background thread;
 * the foreground thread starts the initialization of the partition (e.g. reset 
offset);
 * quickstart-events-2 is removed by the background thread;
 * the initialization completes and quickstart-events-2 does not exist anymore.

 
{code:java}
[2024-02-06 16:21:57,375] ERROR Error processing message, terminating consumer 
process:  (kafka.tools.ConsoleConsumer$)
java.lang.IllegalStateException: No current assignment for partition 
quickstart-events-2
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:367)
at 
org.apache.kafka.clients.consumer.internals.SubscriptionState.updateHighWatermark(SubscriptionState.java:579)
at 
org.apache.kafka.clients.consumer.internals.FetchCollector.handleInitializeSuccess(FetchCollector.java:283)
at 
org.apache.kafka.clients.consumer.internals.FetchCollector.initialize(FetchCollector.java:226)
at 
org.apache.kafka.clients.consumer.internals.FetchCollector.collectFetch(FetchCollector.java:110)
at 
org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.collectFetch(AsyncKafkaConsumer.java:1540)
at 
org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.pollForFetches(AsyncKafkaConsumer.java:1525)
at 
org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.poll(AsyncKafkaConsumer.java:711)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
at 
kafka.tools.ConsoleConsumer$ConsumerWrapper.receive(ConsoleConsumer.scala:473)
at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:103)
at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:77)
at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:54)
at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala) {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] MINOR:Type Casting Correction AND Null Pointer Exception (NPE) Defense [kafka]

2024-02-06 Thread via GitHub



highluck commented on PR #9786:
URL: https://github.com/apache/kafka/pull/9786#issuecomment-1930075970

   @mimaison 
   
   Can you review this? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384]] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384]] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
>  
>  



--
This

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|https://github.com/apache/kafka/pull/14384] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 

  was:
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384]] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|https://github.com/apache/kafka/pull/14384] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
>  
>  



--
This message

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background

https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed

The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

 

 

  was:
h1. Background


https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed


The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384],] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix
>  
>  



--
This

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: image-20240201-105752.png

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384],] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders. 
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
h1. Background


https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.
h1. What changed


The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

h1. Why it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders.

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!
h2. Lock Profile: Kafka-15415

!Screenshot 2024-02-01 at 11.06.36.png!
h2. Lock Profile: Baseline

!image-20240201-105752.png!
h1. Fix

  was:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders. 

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!

Fix


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png, 
> image-20240201-105752.png
>
>
> h1. Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> h1. What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> h1. Why it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384],] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders.
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> h2. Lock Profile: Kafka-15415
> !Screenshot 2024-02-01 at 11.06.36.png!
> h2. Lock Profile: Baseline
> !image-20240201-105752.png!
> h1. Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-16066: Upgrade apacheds to 2.0.0.AM27 With apache kerby [kafka]

2024-02-06 Thread via GitHub



OmniaGM commented on code in PR #15277:
URL: https://github.com/apache/kafka/pull/15277#discussion_r1480038350


##
core/src/test/scala/kafka/security/minikdc/MiniKdc.scala:
##
@@ -19,38 +19,22 @@
 package kafka.security.minikdc
 
 import java.io._
-import java.net.InetSocketAddress
 import java.nio.charset.StandardCharsets
 import java.nio.file.Files
 import java.text.MessageFormat
 import java.util.{Locale, Properties, UUID}
-
 import kafka.utils.{CoreUtils, Exit, Logging}
+import 
org.apache.directory.server.kerberos.shared.crypto.encryption.KerberosKeyFactory
 
 import scala.jdk.CollectionConverters._
-import org.apache.commons.lang.text.StrSubstitutor
-import org.apache.directory.api.ldap.model.entry.{DefaultEntry, Entry}
-import org.apache.directory.api.ldap.model.ldif.LdifReader
-import org.apache.directory.api.ldap.model.name.Dn
-import 
org.apache.directory.api.ldap.schema.extractor.impl.DefaultSchemaLdifExtractor
-import org.apache.directory.api.ldap.schema.loader.LdifSchemaLoader
-import org.apache.directory.api.ldap.schema.manager.impl.DefaultSchemaManager
-import org.apache.directory.server.constants.ServerDNConstants
-import org.apache.directory.server.core.DefaultDirectoryService
-import org.apache.directory.server.core.api.{CacheService, DirectoryService, 
InstanceLayout}
-import org.apache.directory.server.core.api.schema.SchemaPartition
-import org.apache.directory.server.core.kerberos.KeyDerivationInterceptor
-import org.apache.directory.server.core.partition.impl.btree.jdbm.{JdbmIndex, 
JdbmPartition}
-import org.apache.directory.server.core.partition.ldif.LdifPartition
-import org.apache.directory.server.kerberos.KerberosConfig
-import org.apache.directory.server.kerberos.kdc.KdcServer
-import 
org.apache.directory.server.kerberos.shared.crypto.encryption.KerberosKeyFactory
-import org.apache.directory.server.kerberos.shared.keytab.{Keytab, KeytabEntry}
-import org.apache.directory.server.protocol.shared.transport.{TcpTransport, 
UdpTransport}
-import org.apache.directory.server.xdbm.Index
-import org.apache.directory.shared.kerberos.KerberosTime
+import org.apache.kerby.kerberos.kerb.KrbException
+import org.apache.kerby.kerberos.kerb.identity.backend.BackendConfig
+import org.apache.kerby.kerberos.kerb.server.{KdcConfig, KdcConfigKey, 
SimpleKdcServer}
 import org.apache.kafka.common.utils.{Java, Utils}
-
+import org.apache.kerby.kerberos.kerb.`type`.KerberosTime
+import org.apache.kerby.kerberos.kerb.`type`.base.{EncryptionKey, 
PrincipalName}
+import org.apache.kerby.kerberos.kerb.keytab.{Keytab, KeytabEntry}
+import org.apache.kerby.util.NetworkUtil
 /**
   * Mini KDC based on Apache Directory Server that can be embedded in tests or 
used from command line as a standalone

Review Comment:
   @highluck I believe this java doc still need to change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Attachment: Screenshot 2024-02-01 at 11.06.36.png

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
> Attachments: Screenshot 2024-02-01 at 11.06.36.png
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384],] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders. 
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[https://github.com/apache/kafka/pull/14384],] 
RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
synchronised method Metadata.currentLeader(). This has led to increased 
synchronization between KafkaProducer's application-thread that call send(), 
and background-thread that actively send producer-batches to leaders. 

See lock profiles that clearly show increased synchronisation in KAFKA-15415 
PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
synchronisation is much worse for paritionReady() in this benchmark as its 
called for each partition, and it has 36k partitions!

Fix

  was:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # 
request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384]

Fix


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original 
> [PR|[https://github.com/apache/kafka/pull/14384],] 
> RecordAccmulator.partitionReady() & drainBatchesForOneNode() started using 
> synchronised method Metadata.currentLeader(). This has led to increased 
> synchronization between KafkaProducer's application-thread that call send(), 
> and background-thread that actively send producer-batches to leaders. 
> See lock profiles that clearly show increased synchronisation in KAFKA-15415 
> PR(highlighted in {color:#de350b}Red{color}) Vs baseline. Note the 
> synchronisation is much worse for paritionReady() in this benchmark as its 
> called for each partition, and it has 36k partitions!
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-16066: Upgrade apacheds to 2.0.0.AM27 With apache kerby [kafka]

2024-02-06 Thread via GitHub



highluck commented on PR #15277:
URL: https://github.com/apache/kafka/pull/15277#issuecomment-1930043014

   @divijvaidya @mimaison @OmniaGM
   
   Here’s a reminder!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] MINOR: Add MetadataType metric from KIP-866 #15299 [kafka]

2024-02-06 Thread via GitHub



OmniaGM commented on PR #15306:
URL: https://github.com/apache/kafka/pull/15306#issuecomment-1930043205

   @cmccabe you left a 
[comment](https://github.com/apache/kafka/pull/15299#issuecomment-1922507934 ) 
on the original pr #15299 stating that you are moving the discussion for 
KIP-866 metrics here but the original pr has been committed and this pr seems 
to be just a duplicate of @mumrah's PR. Can you same some context when you have 
time please? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # 
request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384]

Fix

  was:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # 
request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] 

Fix


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # 
> request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original 
> [PR|[http://example.com|http://example.com/]https://github.com/apache/kafka/pull/14384]
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.
 # record-queue-time-avg: increased from 20ms to 30ms.
 # 
request-latency-avg: increased from 50ms to 100ms.

How it happened

As can be seen from the original 
[PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] 

Fix

  was:
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.

1. record_queue_time_avg

Regression Details


Fix


> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
>  # record-queue-time-avg: increased from 20ms to 30ms.
>  # 
> request-latency-avg: increased from 50ms to 100ms.
> How it happened
> As can be seen from the original 
> [PR|[http://example.com]https://github.com/apache/kafka/pull/14384]] 
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Affects Version/s: 3.6.1
   3.7.0

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
> 1. record_queue_time_avg
> Regression Details
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Labels: kip-951  (was: )

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
> 1. record_queue_time_avg
> Regression Details
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Java client: Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Summary: Java client: Performance regression in Trogdor benchmark with high 
partition counts  (was: Performance regression in Trogdor benchmark with high 
partition counts)

> Java client: Performance regression in Trogdor benchmark with high partition 
> counts
> ---
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
> 1. record_queue_time_avg
> Regression Details
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: 
Background
https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
java-client to skip backoff period if client knows of a newer leader, for 
produce-batch being retried.

What changed
The implementation introduced a regression noticed on a trogdor-benchmark 
running with high partition counts(36000!).
With regression, following metrics changed on the produce side.

1. record_queue_time_avg

Regression Details


Fix

  was:For KIP-951, 


> Performance regression in Trogdor benchmark with high partition counts
> --
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Background
> https://issues.apache.org/jira/browse/KAFKA-15415 implemented optimisation in 
> java-client to skip backoff period if client knows of a newer leader, for 
> produce-batch being retried.
> What changed
> The implementation introduced a regression noticed on a trogdor-benchmark 
> running with high partition counts(36000!).
> With regression, following metrics changed on the produce side.
> 1. record_queue_time_avg
> Regression Details
> Fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Description: For KIP-951,   (was: Right now in java-client, 
producer-batches backoff upto retry.backoff.ms(100ms by default). This Jira 
proposes that backoff should be skipped if client knows of a newer-leader for 
the partition in a sub-sequent retry(typically through refresh of 
parition-metadata via the Metadata RPC). This would help improve the latency of 
the produce-request around when partition leadership changes.)

> Performance regression in Trogdor benchmark with high partition counts
> --
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> For KIP-951, 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Issue Type: Bug  (was: Improvement)

> Performance regression in Trogdor benchmark with high partition counts
> --
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
>
> Right now in java-client, producer-batches backoff upto 
> retry.backoff.ms(100ms by default). This Jira proposes that backoff should be 
> skipped if client knows of a newer-leader for the partition in a sub-sequent 
> retry(typically through refresh of parition-metadata via the Metadata RPC). 
> This would help improve the latency of the produce-request around when 
> partition leadership changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16226) Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula updated KAFKA-16226:
--
Fix Version/s: 3.6.2
   3.8.0
   3.7.1
   (was: 3.7.0)
   (was: 3.6.1)

> Performance regression in Trogdor benchmark with high partition counts
> --
>
> Key: KAFKA-16226
> URL: https://issues.apache.org/jira/browse/KAFKA-16226
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
> Fix For: 3.6.2, 3.8.0, 3.7.1
>
>
> Right now in java-client, producer-batches backoff upto 
> retry.backoff.ms(100ms by default). This Jira proposes that backoff should be 
> skipped if client knows of a newer-leader for the partition in a sub-sequent 
> retry(typically through refresh of parition-metadata via the Metadata RPC). 
> This would help improve the latency of the produce-request around when 
> partition leadership changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-16226) Performance regression in Trogdor benchmark with high partition counts

2024-02-06 Thread Mayank Shekhar Narula (Jira)

Mayank Shekhar Narula created KAFKA-16226:
-

 Summary: Performance regression in Trogdor benchmark with high 
partition counts
 Key: KAFKA-16226
 URL: https://issues.apache.org/jira/browse/KAFKA-16226
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Reporter: Mayank Shekhar Narula
Assignee: Mayank Shekhar Narula
 Fix For: 3.7.0, 3.6.1


Right now in java-client, producer-batches backoff upto retry.backoff.ms(100ms 
by default). This Jira proposes that backoff should be skipped if client knows 
of a newer-leader for the partition in a sub-sequent retry(typically through 
refresh of parition-metadata via the Metadata RPC). This would help improve the 
latency of the produce-request around when partition leadership changes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (KAFKA-15970) KIP-951, port newly added tests in FetcherTest.java to FetchRequestManagerTest.ajva

2024-02-06 Thread Mayank Shekhar Narula (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Shekhar Narula resolved KAFKA-15970.
---
Resolution: Fixed

> KIP-951, port newly added tests in FetcherTest.java to 
> FetchRequestManagerTest.ajva
> ---
>
> Key: KAFKA-15970
> URL: https://issues.apache.org/jira/browse/KAFKA-15970
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Affects Versions: 3.7.0
>Reporter: Mayank Shekhar Narula
>Assignee: Mayank Shekhar Narula
>Priority: Major
>  Labels: kip-951
>
> Java client changes for 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-951%3A+Leader+discovery+optimisations+for+the+client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16224) Fix handling of deleted topic when auto-committing before revocation

2024-02-06 Thread Lianet Magrans (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianet Magrans updated KAFKA-16224:
---
Description: 
Current logic for auto-committing offsets when partitions are revoked will 
retry continuously when getting UNKNOWN_TOPIC_OR_PARTITION, leading to the 
member not completing the revocation in time. We should consider this as an 
indication of the topic being deleted, and in the context of committing offsets 
to revoke partitions, we should abort the commit attempt and move on to 
complete and ack the revocation.  
While reviewing this, review the behaviour around this error for other commit 
operations as well in case a similar reasoning should be applied.
Note that legacy coordinator behaviour around this seems to be the same as the 
new consumer currently has.

  was:
Current logic for auto-committing offsets when partitions are revoked will 
retry continuously when getting UNKNOWN_TOPIC_OR_PARTITION, leading to the 
member not completing the revocation in time. We should consider this as an 
indication of the topic being deleted, and in the context of committing offsets 
to revoke partitions, we should abort the commit attempt and move on to 
complete and ack the revocation.  
While reviewing this, review the behaviour around this error for other commit 
operations as well in case a similar reasoning should be applied.


> Fix handling of deleted topic when auto-committing before revocation
> 
>
> Key: KAFKA-16224
> URL: https://issues.apache.org/jira/browse/KAFKA-16224
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Lianet Magrans
>Priority: Major
>  Labels: kip-848-client-support
>
> Current logic for auto-committing offsets when partitions are revoked will 
> retry continuously when getting UNKNOWN_TOPIC_OR_PARTITION, leading to the 
> member not completing the revocation in time. We should consider this as an 
> indication of the topic being deleted, and in the context of committing 
> offsets to revoke partitions, we should abort the commit attempt and move on 
> to complete and ack the revocation.  
> While reviewing this, review the behaviour around this error for other commit 
> operations as well in case a similar reasoning should be applied.
> Note that legacy coordinator behaviour around this seems to be the same as 
> the new consumer currently has.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[PR] working change [kafka]

2024-02-06 Thread via GitHub



msn-tldr opened a new pull request, #15323:
URL: https://github.com/apache/kafka/pull/15323

   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-14945: Add Serializer#serializeToByteBuffer() to reduce memory copying [kafka]

2024-02-06 Thread via GitHub



LinShunKang commented on PR #12685:
URL: https://github.com/apache/kafka/pull/12685#issuecomment-1929526582

   > @LinShunKang , sorry for being late. I had a quick look at #14617, it 
looks like the `ByteBufferSerializer#serialize` is a public API and cannot be 
changed without KIP. You know more than I do, so, in your opinion, what should 
we do from now? Could we not touch the `ByteBufferSerializer#serialize` and 
only implement `ByteBufferSerializer#serializeToByteBuffer`? Do you think we 
should include people in this [PR](https://github.com/apache/kafka/pull/14617) 
to this PR for discussion? Or do you have any other thoughts?
   
   We could not only implement `ByteBufferSerializer#serializeToByteBuffer` 
because if `Serializer` implements this method, then the `Serializer#serialize` 
will never be called. And `ByteBufferSerializer#serialize` has obvious logical 
problems.
   
   I believe we should address the logical issues in 
`ByteBufferSerializer#serialize`, but this will introduce breaking changes for 
existing users. And the `Serializer` should not have both `serialize` and 
`serializeToByteBuffer` methods at the same time. Therefore, I suggest we 
tackle these issues in Kafka 4.0, where we can modify the signature of the 
`Serializer`:
   from 
   ```
   //before 4.0
   public interface Serializer {
   
   byte[] serialize(String topic, T data);
   
   default byte[] serialize(String topic, Headers headers, T data) {
   return serialize(topic, data);
   }
   }
   ```
   to
   ```
   //since 4.0
   public interface Serializer {
   
   ByteBuffer serialize(String topic, T data);
   
   ByteBuffer serialize(String topic, Headers headers, T data);
   }
   ```
   
   Then we could announce that we are modified the signature of the 
`Serializer` for existing users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] MINOR: Update LICENSE-binary file [kafka]

2024-02-06 Thread via GitHub



jlprat commented on PR #15322:
URL: https://github.com/apache/kafka/pull/15322#issuecomment-1929371251

   If you need to run this several times, you can try to omit the test suite, 
it will be faster :)
   ```
   ./gradlewAll clean releaseTarGz -x test
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] MINOR: Update LICENSE-binary file [kafka]

2024-02-06 Thread via GitHub



mimaison opened a new pull request, #15322:
URL: https://github.com/apache/kafka/pull/15322

   Tested with:
   ```
   $ ./gradlewAll clean releaseTarGz
   $ tar xzf core/build/distributions/kafka_2.13-3.8.0-SNAPSHOT.tgz
   $ cd kafka_2.13-3.8.0-SNAPSHOT/
   $ for f in $(ls libs | grep -v "^kafka\|connect\|trogdor"); do if ! grep -q 
${f%.*} LICENSE; then echo "${f%.*} is missing in license file"; fi; done
   ```
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-15561: Client support for new SubscriptionPattern based subscription [kafka]

2024-02-06 Thread via GitHub



Phuc-Hong-Tran commented on PR #15188:
URL: https://github.com/apache/kafka/pull/15188#issuecomment-1929317343

   @cadonna @lianetm, since we're supporting for both subscribe method using 
java.util.regex.Pattern and SubscriptionPattern, I think we should throw a 
illegal heartbeat exeption when user try to use both method at the same time 
and inform the user to use once at a time, since the field SubscribedRegex is 
used for java.util.regex.Pattern as well as SubscriptionPattern. What do you 
guys think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] MINOR: Upgrade maven artifact version to 3.9.6 [kafka]

2024-02-06 Thread via GitHub



mimaison merged PR #15309:
URL: https://github.com/apache/kafka/pull/15309


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16181: Use incrementalAlterConfigs when updating broker configs by kafka-configs.sh [kafka]

2024-02-06 Thread via GitHub



dajac commented on code in PR #15304:
URL: https://github.com/apache/kafka/pull/15304#discussion_r1479611758


##
core/src/test/scala/integration/kafka/admin/ConfigCommandIntegrationTest.scala:
##
@@ -162,19 +174,198 @@ class ConfigCommandIntegrationTest extends 
QuorumTestHarness with Logging {
 assertThrows(classOf[ConfigException], () => alterConfigWithZk(configs, 
None, encoderConfigs))
 
 // Dynamic config updates using ZK should fail if broker is running.
-registerBrokerInZk(brokerId.toInt)
+registerBrokerInZk(zkClient, brokerId.toInt)
 assertThrows(classOf[IllegalArgumentException], () => 
alterConfigWithZk(Map("message.max.size" -> "21"), Some(brokerId)))
 assertThrows(classOf[IllegalArgumentException], () => 
alterConfigWithZk(Map("message.max.size" -> "22"), None))
 
 // Dynamic config updates using ZK should for a different broker that is 
not running should succeed
 alterAndVerifyConfig(Map("message.max.size" -> "23"), Some("2"))
   }
 
-  private def registerBrokerInZk(id: Int): Unit = {
+  private def registerBrokerInZk(zkClient: kafka.zk.KafkaZkClient, id: Int): 
Unit = {
 zkClient.createTopLevelPaths()
 val securityProtocol = SecurityProtocol.PLAINTEXT
 val endpoint = new EndPoint("localhost", 9092, 
ListenerName.forSecurityProtocol(securityProtocol), securityProtocol)
 val brokerInfo = BrokerInfo(Broker(id, Seq(endpoint), rack = None), 
MetadataVersion.latestTesting, jmxPort = 9192)
 zkClient.registerBroker(brokerInfo)
   }
+
+  @ClusterTest
+  def testUpdateInvalidBrokersConfig(): Unit = {
+checkInvalidBrokerConfig(None)
+
checkInvalidBrokerConfig(Some(cluster.anyBrokerSocketServer().config.brokerId.toString))
+  }
+
+  private def checkInvalidBrokerConfig(entityNameOrDefault: Option[String]): 
Unit = {
+for (incremental <- Array(true, false)) {
+  val entityNameParams = entityNameOrDefault.map(name => 
Array("--entity-name", name)).getOrElse(Array("--entity-default"))
+  ConfigCommand.alterConfig(cluster.createAdminClient(), new 
ConfigCommandOptions(
+Array("--bootstrap-server", s"${cluster.bootstrapServers()}",
+  "--alter",
+  "--add-config", "invalid=2",
+  "--entity-type", "brokers")
+  ++ entityNameParams
+  ), incremental)
+
+  val describeResult = TestUtils.grabConsoleOutput(
+ConfigCommand.describeConfig(cluster.createAdminClient(), new 
ConfigCommandOptions(
+  Array("--bootstrap-server", s"${cluster.bootstrapServers()}",
+"--describe",
+"--entity-type", "brokers")
+++ entityNameParams
+)))
+  // We will treat unknown config as sensitive
+  assertTrue(describeResult.contains("sensitive=true"))
+  // Sensitive config will not return
+  assertTrue(describeResult.contains("invalid=null"))
+}
+  }
+
+  @ClusterTest
+  def testUpdateInvalidTopicConfig(): Unit = {
+TestUtils.createTopicWithAdminRaw(
+  admin = cluster.createAdminClient(),
+  topic = "test-config-topic",
+)
+assertInstanceOf(
+  classOf[InvalidConfigurationException],
+  assertThrows(
+classOf[ExecutionException],
+() => ConfigCommand.alterConfig(cluster.createAdminClient(), new 
ConfigCommandOptions(
+  Array("--bootstrap-server", s"${cluster.bootstrapServers()}",
+"--alter",
+"--add-config", "invalid=2",
+"--entity-type", "topics",
+"--entity-name", "test-config-topic")
+), true)).getCause
+)
+  }
+
+  @ClusterTest
+  def testUpdateAndDeleteBrokersConfig(): Unit = {
+checkBrokerConfig(None)
+
checkBrokerConfig(Some(cluster.anyBrokerSocketServer().config.brokerId.toString))
+  }
+
+  private def checkBrokerConfig(entityNameOrDefault: Option[String]): Unit = {
+val entityNameParams = entityNameOrDefault.map(name => 
Array("--entity-name", name)).getOrElse(Array("--entity-default"))
+ConfigCommand.alterConfig(cluster.createAdminClient(), new 
ConfigCommandOptions(
+  Array("--bootstrap-server", s"${cluster.bootstrapServers()}",
+"--alter",
+"--add-config", "log.cleaner.threads=2",
+"--entity-type", "brokers")
+++ entityNameParams
+), true)
+TestUtils.waitUntilTrue(
+  () => cluster.brokerSocketServers().asScala.forall(broker => 
broker.config.getInt("log.cleaner.threads") == 2),
+  "Timeout waiting for topic config propagating to broker")
+
+val describeResult = TestUtils.grabConsoleOutput(
+  ConfigCommand.describeConfig(cluster.createAdminClient(), new 
ConfigCommandOptions(
+Array("--bootstrap-server", s"${cluster.bootstrapServers()}",
+  "--describe",
+  "--entity-type", "brokers")
+  ++ entityNameParams
+  )))
+assertTrue(describeResult.contains("log.cleaner.threads=2"))
+assertTrue(describeResult.contains("sensitive=false"))
+
+

Re: [PR] KAFKA-15561: Client support for new SubscriptionPattern based subscription [kafka]

2024-02-06 Thread via GitHub



Phuc-Hong-Tran commented on code in PR #15188:
URL: https://github.com/apache/kafka/pull/15188#discussion_r1479598776


##
clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:
##
@@ -84,6 +85,9 @@ private enum SubscriptionType {
 /* the pattern user has requested */
 private Pattern subscribedPattern;
 
+/* we should rename this to something more specific */
+private SubscriptionPattern subscriptionPattern;

Review Comment:
   Will do 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-15717: Added KRaft support in LeaderEpochIntegrationTest [kafka]

2024-02-06 Thread via GitHub



mimaison commented on PR #15225:
URL: https://github.com/apache/kafka/pull/15225#issuecomment-1929267278

   Done, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (KAFKA-15717) KRaft support in LeaderEpochIntegrationTest

2024-02-06 Thread Mickael Maison (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-15717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mickael Maison reassigned KAFKA-15717:
--

Assignee: appchemist

> KRaft support in LeaderEpochIntegrationTest
> ---
>
> Key: KAFKA-15717
> URL: https://issues.apache.org/jira/browse/KAFKA-15717
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Sameer Tejani
>Assignee: appchemist
>Priority: Minor
>  Labels: kraft, kraft-test, newbie
> Fix For: 3.8.0
>
>
> The following tests in LeaderEpochIntegrationTest in 
> core/src/test/scala/unit/kafka/server/epoch/LeaderEpochIntegrationTest.scala 
> need to be updated to support KRaft
> 67 : def shouldAddCurrentLeaderEpochToMessagesAsTheyAreWrittenToLeader(): 
> Unit = {
> 99 : def shouldSendLeaderEpochRequestAndGetAResponse(): Unit = {
> 144 : def shouldIncreaseLeaderEpochBetweenLeaderRestarts(): Unit = {
> Scanned 305 lines. Found 0 KRaft tests out of 3 tests



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Re: [PR] KAFKA-16215; KAFKA-16178: Fix member not rejoining after error [kafka]

2024-02-06 Thread via GitHub



AndrewJSchofield commented on code in PR #15311:
URL: https://github.com/apache/kafka/pull/15311#discussion_r1479506790


##
clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestState.java:
##
@@ -106,6 +106,13 @@ public void onSendAttempt(final long currentTimeMs) {
 this.lastSentMs = currentTimeMs;
 }
 
+/**
+ * Update the lastReceivedTime in milliseconds, indicating that a response 
has been received.
+ */
+public void updateLastReceivedTime(final long lastReceivedMs) {
+this.lastReceivedMs = lastReceivedMs;

Review Comment:
   You're probably right @dajac. Should we be applying exponential backoff for 
heartbeats? Does it only apply to a subset of errors? As a starting position, I 
would have said that any response or error updates the last receive time, and 
we would apply exponential backoff in all cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (KAFKA-16190) Member should send full heartbeat when rejoining

2024-02-06 Thread Phuc Hong Tran (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814698#comment-17814698
 ] 

Phuc Hong Tran commented on KAFKA-16190:


[~lianetm] I understand, thanks for the explaination.

> Member should send full heartbeat when rejoining
> 
>
> Key: KAFKA-16190
> URL: https://issues.apache.org/jira/browse/KAFKA-16190
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, consumer
>Reporter: Lianet Magrans
>Assignee: Quoc Phong Dang
>Priority: Critical
>  Labels: client-transitions-issues, kip-848-client-support, newbie
> Fix For: 3.8.0
>
>
> The heartbeat request builder should make sure that all fields are sent in 
> the heartbeat request when the consumer rejoins (currently the 
> HeartbeatRequestManager request builder is reset on failure scenarios, which 
> should cover the fence+rejoin sequence). 
> Note that the existing HeartbeatRequestManagerTest.testHeartbeatState misses 
> this exact case given that it does explicitly change the subscription when it 
> gets fenced. We should ensure we test a consumer that keeps it same initial 
> subscription when it rejoins after being fenced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

99 matches

Mail list logo