[jira] [Commented] (KAFKA-10859) add @Test annotation to FileStreamSourceTaskTest.testInvalidFile and reduce the loop count to speedup the test
[ https://issues.apache.org/jira/browse/KAFKA-10859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409847#comment-17409847 ] Borzoo Esmailloo commented on KAFKA-10859: -- Hey [~chia7712], could you please take a look at my PR? :) > add @Test annotation to FileStreamSourceTaskTest.testInvalidFile and reduce > the loop count to speedup the test > -- > > Key: KAFKA-10859 > URL: https://issues.apache.org/jira/browse/KAFKA-10859 > Project: Kafka > Issue Type: Improvement >Reporter: Chia-Ping Tsai >Assignee: Tom Bentley >Priority: Major > Labels: newbie > > FileStreamSourceTaskTest.testInvalidFile miss a `@Test` annotation. Also, it > loops 100 times which spend about 2m to complete a unit test. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13206) shutting down broker needs to stop fetching as a follower in KRaft mode
[ https://issues.apache.org/jira/browse/KAFKA-13206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HaiyuanZhao reassigned KAFKA-13206: --- Assignee: HaiyuanZhao > shutting down broker needs to stop fetching as a follower in KRaft mode > --- > > Key: KAFKA-13206 > URL: https://issues.apache.org/jira/browse/KAFKA-13206 > Project: Kafka > Issue Type: Bug > Components: core, kraft, replication >Affects Versions: 3.0.0 >Reporter: Jun Rao >Assignee: HaiyuanZhao >Priority: Major > Labels: kip-500 > > In the ZK mode, the controller will send a stopReplica(with deletion flag as > false) request to the shutting down broker so that it will stop the followers > from fetching. In KRaft mode, we don't have a corresponding logic. This means > unnecessary rejected fetch follower requests during controlled shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] ijuma opened a new pull request #11295: KAFKA-13270: Set JUTE_MAXBUFFER to 4 MB by default
ijuma opened a new pull request #11295: URL: https://github.com/apache/kafka/pull/11295 ZooKeeper 3.6.0 changed the default configuration for JUTE_MAXBUFFER from 4 MB to 1 MB. This causes a regression if Kafka tries to retrieve a large amount of data across many znodes – in such a case the ZooKeeper client will repeatedly emit a message of the form "java.io.IOException: Packet len <> is out of range". We restore the 3.4.x/3.5.x behavior unless the caller has set the property (note that ZKConfig auto configures itself if certain system properties have been set). I added a unit test that fails without the change and passes with it. See https://github.com/apache/zookeeper/pull/1129 for the details on why the behavior changed in 3.6.0. Credit to @rondagostino for finding and reporting this issue. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage possibly when exactly_once is enabled. Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue kstream runs with *exactly-once* We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage possibly when exactly_once is enabled. Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue kstream runs with *exactly-once* We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage possibly when exactly_once is enabled. Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1{code} We will be
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Summary: KStream offset stuck after brokers outage (was: KStream offset stuck after brokers outage when exactly_once enabled) > KStream offset stuck after brokers outage > - > > Key: KAFKA-13272 > URL: https://issues.apache.org/jira/browse/KAFKA-13272 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 > Environment: Kafka running on Kubernetes > centos >Reporter: F Méthot >Priority: Major > > Our KStream app offset stay stuck on 1 partition after outage when > exactly_once is enabled. > Running with KStream 2.8, kafka broker 2.8, > 3 brokers. > commands topic is 10 partitions (replication 2, min-insync 2) > command-expiry-store-changelog topic is 10 partitions (replication 2, > min-insync 2) > events topic is 10 partitions (replication 2, min-insync 2) > with this topology > Topologies: > > {code:java} > Sub-topology: 0 > Source: KSTREAM-SOURCE-00 (topics: [commands]) > --> KSTREAM-TRANSFORM-01 > Processor: KSTREAM-TRANSFORM-01 (stores: []) > --> KSTREAM-TRANSFORM-02 > <-- KSTREAM-SOURCE-00 > Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) > --> KSTREAM-SINK-03 > <-- KSTREAM-TRANSFORM-01 > Sink: KSTREAM-SINK-03 (topic: events) > <-- KSTREAM-TRANSFORM-02 > {code} > h3. > h3. Attempt 1 at reproducing this issue > > Our stream app runs with processing.guarantee *exactly_once* > After a Kafka test outage where all 3 brokers pod were deleted at the same > time, > Brokers restarted and initialized succesfuly. > When restarting the topology above, one of the tasks would never initialize > fully, the restore phase would keep outputting this messages every few > minutes: > > {code:java} > 2021-08-16 14:20:33,421 INFO stream-thread > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > Restoration in progress for 1 partitions. > {commands-processor-expiry-store-changelog-8: position=11775908, > end=11775911, totalRestored=2002076} > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > (org.apache.kafka.streams.processor.internals.StoreChangelogReader) > {code} > Task for partition 8 would never initialize, no more data would be read from > the source commands topic for that partition. > > In an attempt to recover, we restarted the stream app with stream > processing.guarantee back to at_least_once, than it proceed with reading the > changelog and restoring partition 8 fully. > But we noticed afterward, for the next hour until we rebuilt the system, that > partition 8 from command-expiry-store-changelog would not be > cleaned/compacted by the log cleaner/compacter compared to other partitions. > (could be unrelated, because we have seen that before) > So we resorted to delete/recreate our command-expiry-store-changelog topic > and events topic and regenerate it from the commands, reading from beginning. > Things went back to normal > h3. Attempt 2 at reproducing this issue > We force-deleted all 3 pod running kafka. > After that, one of the partition can’t be restored. (like reported in > previous attempt) > For that partition, we noticed these logs on the broker > {code:java} > [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: > Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, > command-expiry-store-changelog-9) while trying to send transaction markers > for commands-processor-0_9, these partitions are likely deleted already and > hence can be skipped > (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} > Then > - we stop the kstream app, > - restarted kafka brokers cleanly > - Restarting the Kstream app, > Those logs messages showed up on the kstream app log: > > {code:java} > 2021-08-27 18:34:42,413 INFO [Consumer > clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, > groupId=commands-processor] The following partitions still have unstable > offsets which are not cleared on the broker side: [commands-9], this could be > either transactional offsets waiting for completion, or normal offsets > waiting for replication after appending to local log > [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > > {code} > This would cause our processor to not consume from that specific source > topic-partition. > Deleting downstream topic and replaying data would NOT fix the issue > (EXACTLY_ONCE or AT_LEAST_ONCE) > Workaround found: > Deleted the group associated with
[GitHub] [kafka] dajac merged pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error for AlterConsumerGroupOffsetsHandler
dajac merged pull request #11086: URL: https://github.com/apache/kafka/pull/11086 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos
[ https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409657#comment-17409657 ] Josep Prat commented on KAFKA-13243: Sent right now > Differentiate metric latency measured in millis and nanos > - > > Key: KAFKA-13243 > URL: https://issues.apache.org/jira/browse/KAFKA-13243 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Guozhang Wang >Assignee: Josep Prat >Priority: Major > Labels: needs-kip > > Today most of the client latency metrics are measured in millis, and some in > nanos. For those measured in nanos we usually differentiate them by having a > `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and > `io-time-ns-avg`. But there are a few that we obviously forgot to follow this > pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` > suffix and `total` has not. I did a quick search and found just three of them: > * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total > * io-wait-time-total -> io-wait-time-ns-total > * iotime-total -> io-time-ns-total (note that there are two inconsistencies > on naming, the average metric is `io-time-ns-avg` whereas total is > `iotime-total`, I suggest we use `io-time` instead of `iotime` for both). > We should change their name accordingly with the `-ns` suffix as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos
[ https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409653#comment-17409653 ] Guozhang Wang commented on KAFKA-13243: --- Thanks [~josep.prat] Could you also send an email for the KIP voting? Since it is a rather small KIP I think we do not need a DISCUSS thread. > Differentiate metric latency measured in millis and nanos > - > > Key: KAFKA-13243 > URL: https://issues.apache.org/jira/browse/KAFKA-13243 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Guozhang Wang >Assignee: Josep Prat >Priority: Major > Labels: needs-kip > > Today most of the client latency metrics are measured in millis, and some in > nanos. For those measured in nanos we usually differentiate them by having a > `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and > `io-time-ns-avg`. But there are a few that we obviously forgot to follow this > pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` > suffix and `total` has not. I did a quick search and found just three of them: > * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total > * io-wait-time-total -> io-wait-time-ns-total > * iotime-total -> io-time-ns-total (note that there are two inconsistencies > on naming, the average metric is `io-time-ns-avg` whereas total is > `iotime-total`, I suggest we use `io-time` instead of `iotime` for both). > We should change their name accordingly with the `-ns` suffix as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-12634: -- Labels: newbie++ (was: ) > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Matthias J. Sax >Priority: Major > Labels: newbie++ > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] vamossagar12 commented on a change in pull request #11211: KAFKA-12960: Enforcing strict retention time for WindowStore and Sess…
vamossagar12 commented on a change in pull request #11211: URL: https://github.com/apache/kafka/pull/11211#discussion_r702070313 ## File path: streams/src/main/java/org/apache/kafka/streams/state/internals/MeteredWindowStore.java ## @@ -292,13 +408,46 @@ public V fetch(final K key, time); } +private long getActualWindowStartTime(final long timeFrom) { +return Math.max(timeFrom, ((PersistentWindowStore) wrapped()).getObservedStreamTime() - retentionPeriod + 1); +} + +private KeyValueIterator, V> filterExpiredRecords(final boolean forward) { +final KeyValueIterator, byte[]> allWindowedKeyValueIterator = forward ? wrapped().all() : wrapped().backwardAll(); + +final long observedStreamTime = ((PersistentWindowStore) wrapped()).getObservedStreamTime(); +if (!allWindowedKeyValueIterator.hasNext() || observedStreamTime == ConsumerRecord.NO_TIMESTAMP) +return new MeteredWindowedKeyValueIterator<>(allWindowedKeyValueIterator, fetchSensor, streamsMetrics, serdes, time); + +final long windowStartBoundary = observedStreamTime - retentionPeriod + 1; +final List, byte[]>> windowedKeyValuesInBoundary = new ArrayList<>(); + +while (allWindowedKeyValueIterator.hasNext()) { +final KeyValue, byte[]> next = allWindowedKeyValueIterator.next(); +if (next.key.window().endTime().isBefore(Instant.ofEpochMilli(windowStartBoundary))) { +continue; +} +windowedKeyValuesInBoundary.add(next); +} +return new MeteredWindowedKeyValueIterator<>(new WindowedKeyValueIterator(windowedKeyValuesInBoundary.iterator()), fetchSensor, streamsMetrics, serdes, time); +} Review comment: @mjsax i have a question here... In the jira ticket, you. have mentioned that the best place for adding this filtering is in the MeteredStore as that implicitly adds the logic even for custom state stores. While for the most part, this kind of filtering has worked fine(fetching relevant records and then filtering in MeteredStore) but there's a case where it's failing. It's for test. cases like `shouldNotThrowConcurrentModificationException` . This seems to be because the put() call while iterating is appending to the wrapped instance of iterator and hence it's not visible. Looking at this, do you think it would be a good idea to move this logic in the actual RocksDB implementations? Or do you think there's a better way to do it here in MeteredStore class itself? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage when exactly_once is enabled. Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1{code} We will be doing more
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage when exactly_once . Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1{code} We will be doing more tests and
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage when exactly_once . Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : {code:java} StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1{code} We will be doing more of test
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage when exactly_once . Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1 We will be doing more of test and I will update
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Description: Our KStream app offset stay stuck on 1 partition after outage when exactly_once . Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE) producer.delivery.timeout.ms=12 consumer.session.timeout.ms=3 consumer.heartbeat.interval.ms=1 consumer.max.poll.interval.ms=30 num.stream.threads=1 We will be doing more of test and I will update
[jira] [Updated] (KAFKA-13272) KStream offset stuck with exactly_once enabled after brokers outage
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Summary: KStream offset stuck with exactly_once enabled after brokers outage (was: KStream offset stuck with exactly_once after brokers outage) > KStream offset stuck with exactly_once enabled after brokers outage > --- > > Key: KAFKA-13272 > URL: https://issues.apache.org/jira/browse/KAFKA-13272 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 > Environment: Kafka running on Kubernetes > centos >Reporter: F Méthot >Priority: Major > > Our KStream app offset stay stuck with exactly_once after outage. > Running with KStream 2.8, kafka broker 2.8, > 3 brokers. > commands topic is 10 partitions (replication 2, min-insync 2) > command-expiry-store-changelog topic is 10 partitions (replication 2, > min-insync 2) > events topic is 10 partitions (replication 2, min-insync 2) > with this topology > Topologies: > > {code:java} > Sub-topology: 0 > Source: KSTREAM-SOURCE-00 (topics: [commands]) > --> KSTREAM-TRANSFORM-01 > Processor: KSTREAM-TRANSFORM-01 (stores: []) > --> KSTREAM-TRANSFORM-02 > <-- KSTREAM-SOURCE-00 > Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) > --> KSTREAM-SINK-03 > <-- KSTREAM-TRANSFORM-01 > Sink: KSTREAM-SINK-03 (topic: events) > <-- KSTREAM-TRANSFORM-02 > {code} > h3. > Attempt 1 at reproducing this issue > > Our stream app runs with processing.guarantee *exactly_once* > After a Kafka test outage where all 3 brokers pod were deleted at the same > time, > Brokers restarted and initialized succesfuly. > When restarting the topology above, one of the tasks would never initialize > fully, the restore phase would keep outputting this messages every few > minutes: > > {code:java} > 2021-08-16 14:20:33,421 INFO stream-thread > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > Restoration in progress for 1 partitions. > {commands-processor-expiry-store-changelog-8: position=11775908, > end=11775911, totalRestored=2002076} > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > (org.apache.kafka.streams.processor.internals.StoreChangelogReader) > {code} > Task for partition 8 would never initialize, no more data would be read from > the source commands topic for that partition. > > In an attempt to recover, we restarted the stream app with stream > processing.guarantee back to at_least_once, than it proceed with reading the > changelog and restoring partition 8 fully. > But we noticed afterward, for the next hour until we rebuilt the system, that > partition 8 from command-expiry-store-changelog would not be > cleaned/compacted by the log cleaner/compacter compared to other partitions. > (could be unrelated, because we have seen that before) > So we resorted to delete/recreate our command-expiry-store-changelog topic > and events topic and regenerate it from the commands, reading from beginning. > Things went back to normal > h3. Attempt 2 at reproducing this issue > We force-deleted all 3 pod running kafka. > After that, one of the partition can’t be restored. (like reported in > previous attempt) > For that partition, we noticed these logs on the broker > {code:java} > [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: > Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, > command-expiry-store-changelog-9) while trying to send transaction markers > for commands-processor-0_9, these partitions are likely deleted already and > hence can be skipped > (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} > Then > - we stop the kstream app, > - restarted kafka brokers cleanly > - Restarting the Kstream app, > Those logs messages showed up on the kstream app log: > > {code:java} > 2021-08-27 18:34:42,413 INFO [Consumer > clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, > groupId=commands-processor] The following partitions still have unstable > offsets which are not cleared on the broker side: [commands-9], this could be > either transactional offsets waiting for completion, or normal offsets > waiting for replication after appending to local log > [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > > {code} > This would cause our processor to not consume from that specific source > topic-partition. > Deleting downstream topic and replaying data would NOT fix the issue > (EXACTLY_ONCE or AT_LEAST_ONCE) > Workaround found: > Deleted
[jira] [Updated] (KAFKA-13272) KStream offset stuck after brokers outage when exactly_once enabled
[ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] F Méthot updated KAFKA-13272: - Summary: KStream offset stuck after brokers outage when exactly_once enabled (was: KStream offset stuck with exactly_once enabled after brokers outage) > KStream offset stuck after brokers outage when exactly_once enabled > --- > > Key: KAFKA-13272 > URL: https://issues.apache.org/jira/browse/KAFKA-13272 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 > Environment: Kafka running on Kubernetes > centos >Reporter: F Méthot >Priority: Major > > Our KStream app offset stay stuck with exactly_once after outage. > Running with KStream 2.8, kafka broker 2.8, > 3 brokers. > commands topic is 10 partitions (replication 2, min-insync 2) > command-expiry-store-changelog topic is 10 partitions (replication 2, > min-insync 2) > events topic is 10 partitions (replication 2, min-insync 2) > with this topology > Topologies: > > {code:java} > Sub-topology: 0 > Source: KSTREAM-SOURCE-00 (topics: [commands]) > --> KSTREAM-TRANSFORM-01 > Processor: KSTREAM-TRANSFORM-01 (stores: []) > --> KSTREAM-TRANSFORM-02 > <-- KSTREAM-SOURCE-00 > Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) > --> KSTREAM-SINK-03 > <-- KSTREAM-TRANSFORM-01 > Sink: KSTREAM-SINK-03 (topic: events) > <-- KSTREAM-TRANSFORM-02 > {code} > h3. > Attempt 1 at reproducing this issue > > Our stream app runs with processing.guarantee *exactly_once* > After a Kafka test outage where all 3 brokers pod were deleted at the same > time, > Brokers restarted and initialized succesfuly. > When restarting the topology above, one of the tasks would never initialize > fully, the restore phase would keep outputting this messages every few > minutes: > > {code:java} > 2021-08-16 14:20:33,421 INFO stream-thread > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > Restoration in progress for 1 partitions. > {commands-processor-expiry-store-changelog-8: position=11775908, > end=11775911, totalRestored=2002076} > [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] > (org.apache.kafka.streams.processor.internals.StoreChangelogReader) > {code} > Task for partition 8 would never initialize, no more data would be read from > the source commands topic for that partition. > > In an attempt to recover, we restarted the stream app with stream > processing.guarantee back to at_least_once, than it proceed with reading the > changelog and restoring partition 8 fully. > But we noticed afterward, for the next hour until we rebuilt the system, that > partition 8 from command-expiry-store-changelog would not be > cleaned/compacted by the log cleaner/compacter compared to other partitions. > (could be unrelated, because we have seen that before) > So we resorted to delete/recreate our command-expiry-store-changelog topic > and events topic and regenerate it from the commands, reading from beginning. > Things went back to normal > h3. Attempt 2 at reproducing this issue > We force-deleted all 3 pod running kafka. > After that, one of the partition can’t be restored. (like reported in > previous attempt) > For that partition, we noticed these logs on the broker > {code:java} > [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: > Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, > command-expiry-store-changelog-9) while trying to send transaction markers > for commands-processor-0_9, these partitions are likely deleted already and > hence can be skipped > (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} > Then > - we stop the kstream app, > - restarted kafka brokers cleanly > - Restarting the Kstream app, > Those logs messages showed up on the kstream app log: > > {code:java} > 2021-08-27 18:34:42,413 INFO [Consumer > clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, > groupId=commands-processor] The following partitions still have unstable > offsets which are not cleared on the broker side: [commands-9], this could be > either transactional offsets waiting for completion, or normal offsets > waiting for replication after appending to local log > [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > > {code} > This would cause our processor to not consume from that specific source > topic-partition. > Deleting downstream topic and replaying data would NOT fix the issue > (EXACTLY_ONCE or AT_LEAST_ONCE) > Workaround found:
[jira] [Created] (KAFKA-13272) KStream offset stuck with exactly_once after brokers outage
F Méthot created KAFKA-13272: Summary: KStream offset stuck with exactly_once after brokers outage Key: KAFKA-13272 URL: https://issues.apache.org/jira/browse/KAFKA-13272 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 2.8.0 Environment: Kafka running on Kubernetes centos Reporter: F Méthot Our KStream app offset stay stuck with exactly_once after outage. Running with KStream 2.8, kafka broker 2.8, 3 brokers. commands topic is 10 partitions (replication 2, min-insync 2) command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2) events topic is 10 partitions (replication 2, min-insync 2) with this topology Topologies: {code:java} Sub-topology: 0 Source: KSTREAM-SOURCE-00 (topics: [commands]) --> KSTREAM-TRANSFORM-01 Processor: KSTREAM-TRANSFORM-01 (stores: []) --> KSTREAM-TRANSFORM-02 <-- KSTREAM-SOURCE-00 Processor: KSTREAM-TRANSFORM-02 (stores: [command-expiry-store]) --> KSTREAM-SINK-03 <-- KSTREAM-TRANSFORM-01 Sink: KSTREAM-SINK-03 (topic: events) <-- KSTREAM-TRANSFORM-02 {code} h3. Attempt 1 at reproducing this issue Our stream app runs with processing.guarantee *exactly_once* After a Kafka test outage where all 3 brokers pod were deleted at the same time, Brokers restarted and initialized succesfuly. When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes: {code:java} 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader) {code} Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition. In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully. But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before) So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. Things went back to normal h3. Attempt 2 at reproducing this issue We force-deleted all 3 pod running kafka. After that, one of the partition can’t be restored. (like reported in previous attempt) For that partition, we noticed these logs on the broker {code:java} [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code} Then - we stop the kstream app, - restarted kafka brokers cleanly - Restarting the Kstream app, Those logs messages showed up on the kstream app log: {code:java} 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} This would cause our processor to not consume from that specific source topic-partition. Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE) Workaround found: Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now ) KStream config : StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000 StreamsConfig.REPLICATION_FACTOR_CONFIG, 2 StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000 StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest” StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE)
[GitHub] [kafka] hachikuji commented on a change in pull request #11288: MINOR: Fix error response generation
hachikuji commented on a change in pull request #11288: URL: https://github.com/apache/kafka/pull/11288#discussion_r702039368 ## File path: clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java ## @@ -706,6 +706,11 @@ private void checkDescribeConfigsResponseVersions() { private void checkErrorResponse(AbstractRequest req, Throwable e, boolean checkEqualityAndHashCode) { AbstractResponse response = req.getErrorResponse(e); checkResponse(response, req.version(), checkEqualityAndHashCode); +Errors error = Errors.forException(e); +Map errorCounts = response.errorCounts(); +assertEquals(1, errorCounts.size()); Review comment: nit: could probably simplify these two assertions a little: ```java assertEquals(Collections.singleton(error), errorCounts.keySet()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on a change in pull request #10914: [KAFKA-8522] Streamline tombstone and transaction marker removal
junrao commented on a change in pull request #10914: URL: https://github.com/apache/kafka/pull/10914#discussion_r702018820 ## File path: core/src/main/scala/kafka/log/LogCleanerManager.scala ## @@ -198,8 +199,23 @@ private[log] class LogCleanerManager(val logDirs: Seq[File], val cleanableLogs = dirtyLogs.filter { ltc => (ltc.needCompactionNow && ltc.cleanableBytes > 0) || ltc.cleanableRatio > ltc.log.config.minCleanableRatio } + if(cleanableLogs.isEmpty) { -None +val logsWithTombstonesExpired = dirtyLogs.filter { + case ltc => +// in this case, we are probably in a low throughput situation +// therefore, we should take advantage of this fact and remove tombstones if we can +// under the condition that the log's latest delete horizon is less than the current time +// tracked +ltc.log.latestDeleteHorizon != RecordBatch.NO_TIMESTAMP && ltc.log.latestDeleteHorizon <= time.milliseconds() Review comment: Yes, ideally, we want to do size based estimate. I just not sure how accurate we can estimate size given batching and compression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (KAFKA-13271) Error while fetching metadata with correlation id 219783 : LEADER_NOT_AVAILABLE
Fátima Galera created KAFKA-13271: - Summary: Error while fetching metadata with correlation id 219783 : LEADER_NOT_AVAILABLE Key: KAFKA-13271 URL: https://issues.apache.org/jira/browse/KAFKA-13271 Project: Kafka Issue Type: Bug Components: producer Affects Versions: 2.2.2 Environment: Production environment Reporter: Fátima Galera Fix For: 2.2.3 Hi dear kafka support We are getting below error after a new connector creation [2021-09-02 19:15:23,878] WARN [Producer clientId=producer-178] Error while fetching metadata with correlation id 219783 : \{ tkr_prd2.tkr.glrep_file2=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) [2021-09-02 19:15:23,982] WARN [Producer clientId=producer-178] Error while fetching metadata with correlation id 219784 : \{ tkr_prd2.tkr.glrep_file2=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) [2021-09-02 19:15:24,095] ERROR WorkerSourceTask\{id=tkr_prd2-0} failed to send record to tkr_prd2.tkr.glrep_file2: {} (org.a pache.kafka.connect.runtime.WorkerSourceTask) [2021-09-02 19:15:24,149] ERROR WorkerSourceTask\{id=tkr_prd2-0} failed to send record to tkr_prd2.tkr.glrep_file2: {} (org.a pache.kafka.connect.runtime.WorkerSourceTask) We are not able to get the offset of this topic. We already changed from {{listeners=PLAINTEXT://hostname:9092 to }}{{listeners=PLAINTEXT://localhost:9092}}{{}} in /etc/kafka/server.properties file and restarted kafka services. But if we use this new value kafka connector service is not able to start because it is not able to find the services with the IP of the host. Could you please let us know what can we do? We already created the connector several times using different names https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos
[ https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409511#comment-17409511 ] Josep Prat commented on KAFKA-13243: KIP can be found here [https://cwiki.apache.org/confluence/display/KAFKA/KIP-773%3A+Differentiate+consistently+metric+latency+measured+in+millis+and+nanos] > Differentiate metric latency measured in millis and nanos > - > > Key: KAFKA-13243 > URL: https://issues.apache.org/jira/browse/KAFKA-13243 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Guozhang Wang >Assignee: Josep Prat >Priority: Major > Labels: needs-kip > > Today most of the client latency metrics are measured in millis, and some in > nanos. For those measured in nanos we usually differentiate them by having a > `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and > `io-time-ns-avg`. But there are a few that we obviously forgot to follow this > pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` > suffix and `total` has not. I did a quick search and found just three of them: > * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total > * io-wait-time-total -> io-wait-time-ns-total > * iotime-total -> io-time-ns-total (note that there are two inconsistencies > on naming, the average metric is `io-time-ns-avg` whereas total is > `iotime-total`, I suggest we use `io-time` instead of `iotime` for both). > We should change their name accordingly with the `-ns` suffix as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] mimaison commented on pull request #11288: MINOR: Fix error response generation
mimaison commented on pull request #11288: URL: https://github.com/apache/kafka/pull/11288#issuecomment-912527113 Thanks @dajac for the feedback. I've addressed your comments. I agree with your comment about testing of these classes. `RequestResponseTest` is messy and it's hard to tell what's actually tested. I wouldn't be surprized if some requests or responses are completely missed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] showuon commented on pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error for AlterConsumerGroupOffsetsHandler
showuon commented on pull request #11086: URL: https://github.com/apache/kafka/pull/11086#issuecomment-912521901 Thanks for reminding. Updated. Thank you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] dajac commented on pull request #11086: KAFKA-13103: add REBALANCE_IN_PROGRESS error as retriable error
dajac commented on pull request #11086: URL: https://github.com/apache/kafka/pull/11086#issuecomment-912511738 @showuon Yes. Could you update the description of the PR to reflect what has been done? The description is outdated. I will merge it after. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] dajac commented on a change in pull request #11288: MINOR: Fix error response generation
dajac commented on a change in pull request #11288: URL: https://github.com/apache/kafka/pull/11288#discussion_r701855200 ## File path: clients/src/main/java/org/apache/kafka/common/requests/DescribeProducersRequest.java ## @@ -69,7 +71,7 @@ public DescribeProducersRequestData data() { @Override public DescribeProducersResponse getErrorResponse(int throttleTimeMs, Throwable e) { Errors error = Errors.forException(e); -DescribeProducersResponseData response = new DescribeProducersResponseData(); +List topics = new ArrayList<>(); Review comment: nit: We could keep instantiating `DescribeProducersResponseData` here and add the topic to it with `reponse.topics.add(..)`. ## File path: clients/src/main/java/org/apache/kafka/common/requests/AlterClientQuotasRequest.java ## @@ -125,7 +127,10 @@ public AlterClientQuotasResponse getErrorResponse(int throttleTimeMs, Throwable .setEntityType(entityData.entityType()) .setEntityName(entityData.entityName())); } -responseEntries.add(new AlterClientQuotasResponseData.EntryData().setEntity(responseEntities)); +responseEntries.add(new AlterClientQuotasResponseData.EntryData() +.setEntity(responseEntities) +.setErrorCode(error.code()) +.setErrorMessage(error.message())); Review comment: nit: Should we use 4 spaces to indent those ones here? We use 4 spaces for similar code at L127/128 so the code would remain a bit more homogeneous. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] dajac opened a new pull request #11294: KAFKA-13266; `InitialFetchState` should be created after partition is removed from the fetchers
dajac opened a new pull request #11294: URL: https://github.com/apache/kafka/pull/11294 `ReplicationTest.test_replication_with_broker_failure` in KRaft mode sometimes fails with the following error in the log: ``` [2021-08-31 11:31:25,092] ERROR [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Unexpected error occurred while processing data for partition __consumer_offsets-1 at offset 31727 (kafka.server.ReplicaFetcherThread)java.lang.IllegalStateException: Offset mismatch for partition __consumer_offsets-1: fetched offset = 31727, log end offset = 31728. at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:194) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$8(AbstractFetcherThread.scala:545) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:533) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7$adapted(AbstractFetcherThread.scala:532) at kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scal a:359) at scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:355) at scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:309) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:532) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:216) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:215) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:215) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:197) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:99)[2021-08-31 11:31:25,093] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Partition __consumer_offsets-1 marked as failed (kafka.server.ReplicaFetcherThread) ``` The issue is due to a race condition in `ReplicaManager#applyLocalFollowersDelta`. The `InitialFetchState` is created and populated before the partition is removed from the fetcher threads. This means that the fetch offset of the `InitialFetchState` could be outdated when the fetcher threads are re-started because the fetcher threads could have incremented the log end offset in between. The patch fixes the issue by removing the partitions from the replica fetcher threads before creating the `InitialFetchState` for them. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] keepal7 commented on pull request #11280: MINOR:When rebalance times out, print the member's clientHost in the …
keepal7 commented on pull request #11280: URL: https://github.com/apache/kafka/pull/11280#issuecomment-912472160 Who can tell me why the pipeline can't pass? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (KAFKA-13243) Differentiate metric latency measured in millis and nanos
[ https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409407#comment-17409407 ] Josep Prat commented on KAFKA-13243: I'll like to tackle this one > Differentiate metric latency measured in millis and nanos > - > > Key: KAFKA-13243 > URL: https://issues.apache.org/jira/browse/KAFKA-13243 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Guozhang Wang >Assignee: Josep Prat >Priority: Major > Labels: needs-kip > > Today most of the client latency metrics are measured in millis, and some in > nanos. For those measured in nanos we usually differentiate them by having a > `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and > `io-time-ns-avg`. But there are a few that we obviously forgot to follow this > pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` > suffix and `total` has not. I did a quick search and found just three of them: > * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total > * io-wait-time-total -> io-wait-time-ns-total > * iotime-total -> io-time-ns-total (note that there are two inconsistencies > on naming, the average metric is `io-time-ns-avg` whereas total is > `iotime-total`, I suggest we use `io-time` instead of `iotime` for both). > We should change their name accordingly with the `-ns` suffix as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13243) Differentiate metric latency measured in millis and nanos
[ https://issues.apache.org/jira/browse/KAFKA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josep Prat reassigned KAFKA-13243: -- Assignee: Josep Prat > Differentiate metric latency measured in millis and nanos > - > > Key: KAFKA-13243 > URL: https://issues.apache.org/jira/browse/KAFKA-13243 > Project: Kafka > Issue Type: Improvement > Components: metrics >Reporter: Guozhang Wang >Assignee: Josep Prat >Priority: Major > Labels: needs-kip > > Today most of the client latency metrics are measured in millis, and some in > nanos. For those measured in nanos we usually differentiate them by having a > `-ns` suffix in the metric names, e.g. `io-wait-time-ns-avg` and > `io-time-ns-avg`. But there are a few that we obviously forgot to follow this > pattern, e.g. `io-wait-time-total`: it is inconsistent where `avg` has `-ns` > suffix and `total` has not. I did a quick search and found just three of them: > * bufferpool-wait-time-total -> bufferpool-wait-time-ns-total > * io-wait-time-total -> io-wait-time-ns-total > * iotime-total -> io-time-ns-total (note that there are two inconsistencies > on naming, the average metric is `io-time-ns-avg` whereas total is > `iotime-total`, I suggest we use `io-time` instead of `iotime` for both). > We should change their name accordingly with the `-ns` suffix as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13247) Adding functionality for loading private key entry by alias from the keystore
[ https://issues.apache.org/jira/browse/KAFKA-13247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Izhikov updated KAFKA-13247: Labels: kip-required (was: ) > Adding functionality for loading private key entry by alias from the keystore > - > > Key: KAFKA-13247 > URL: https://issues.apache.org/jira/browse/KAFKA-13247 > Project: Kafka > Issue Type: Improvement >Reporter: Tigran Margaryan >Priority: Major > Labels: kip-required > > Hello team, > While configuring SSL for Kafka connectivity , I found out that there is no > possibility to choose/load the private key entry by alias from the keystore > defined via > org.apache.kafka.common.config.SslConfigs.SSL_KEYSTORE_LOCATION_CONFIG. It > turns out that the keystore could not have multiple private key entries . > Kindly ask you to add that config (smth. like SSL_KEY_ALIAS_CONFIG) into > SslConfigs with the corresponding functionality which should load only the > private key entry by defined alias. > > Thanks in advance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [kafka] mimaison commented on a change in pull request #11288: MINOR: Fix error response generation
mimaison commented on a change in pull request #11288: URL: https://github.com/apache/kafka/pull/11288#discussion_r701702133 ## File path: clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java ## @@ -706,6 +706,8 @@ private void checkDescribeConfigsResponseVersions() { private void checkErrorResponse(AbstractRequest req, Throwable e, boolean checkEqualityAndHashCode) { AbstractResponse response = req.getErrorResponse(e); checkResponse(response, req.version(), checkEqualityAndHashCode); +Map errorCounts = response.errorCounts(); +assertTrue(errorCounts.containsKey(Errors.forException(e)), "API Key " + req.apiKey().name + "V" + req.version() + " failed errorCounts test"); Review comment: @hachikuji Thanks for the review! I pushed an update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] mimaison commented on a change in pull request #11220: KAFKA-10777: Add additional configuration to control MirrorMaker 2 internal topics naming convention
mimaison commented on a change in pull request #11220: URL: https://github.com/apache/kafka/pull/11220#discussion_r701694521 ## File path: connect/mirror-client/src/main/java/org/apache/kafka/connect/mirror/DefaultReplicationPolicy.java ## @@ -70,4 +70,32 @@ public String upstreamTopic(String topic) { return topic.substring(source.length() + separator.length()); } } + +private String internalSuffix() { +return separator + "internal"; +} + +private String checkpointTopicSufficx() { Review comment: `checkpointTopicSufficx` -> `checkpointsTopicSuffix` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] mimaison commented on pull request #11220: KAFKA-10777: Add additional configuration to control MirrorMaker 2 internal topics naming convention
mimaison commented on pull request #11220: URL: https://github.com/apache/kafka/pull/11220#issuecomment-912351237 Thanks @OmniaGM for the PR. I'll try to take a look next week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org