[jira] [Commented] (KAFKA-9225) kafka fail to run on linux-aarch64
[ https://issues.apache.org/jira/browse/KAFKA-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057644#comment-17057644 ] ASF GitHub Bot commented on KAFKA-9225: --- jiameixie commented on pull request #8284: KAFKA-9225: rocksdb 5.18.3 to 5.18.4 URL: https://github.com/apache/kafka/pull/8284 Bump rocksdb 5.18.3 to 5.18.4 that supports all platforms. Issues about this version are https://github.com/facebook/rocksdb/pull/6497 and https://github.com/facebook/rocksdb/issues/6188 Change-Id: I3febec8e36550edcb7f88839cc1e2b2a54984564 Signed-off-by: Jiamei Xie *More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.* *Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.* ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > kafka fail to run on linux-aarch64 > -- > > Key: KAFKA-9225 > URL: https://issues.apache.org/jira/browse/KAFKA-9225 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: jiamei xie >Priority: Blocker > Labels: incompatible > Fix For: 3.0.0 > > Attachments: compat_report.html > > > *Steps to reproduce:* > 1. Download Kafka latest source code > 2. Build it with gradle > 3. Run > [streamDemo|[https://kafka.apache.org/23/documentation/streams/quickstart]] > when running bin/kafka-run-class.sh > org.apache.kafka.streams.examples.wordcount.WordCountDemo, it crashed with > the following error message > {code:java} > xjm@ubuntu-arm01:~/kafka$bin/kafka-run-class.sh > org.apache.kafka.streams.examples.wordcount.WordCountDemo > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/core/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/tools/build/dependant-libs-2.12.10/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/api/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/transforms/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/runtime/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/file/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/mirror/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/mirror-client/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/json/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/xjm/kafka/connect/basic-auth-extension/build/dependant-libs/slf4j-log4j12-1.7.28.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > [2019-11-19 15:42:23,277] WARN The configuration 'admin.retries' was supplied > but isn't a known config. (org.apache.kafka.clients.consumer.ConsumerConfig) > [2019-11-19 15:42:23,278] WARN The configuration 'admin.retry.backoff.ms' was > supplied but isn't a known config. > (org.apache.kafka.clients.consumer.ConsumerConfig) > [2019-11-19 15:42:24,278] ERROR stream-client > [streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754fad5cd48] All stream threads > have died. The instance will be in error state and should be closed. > (org.apach e.kafka.streams.KafkaStreams) > Exception in thread > "streams-wordcount-0f3cf88b-e2c4-4fb6-b7a3-9754f
[jira] [Commented] (KAFKA-7983) supporting replication.throttled.replicas in dynamic broker configuration
[ https://issues.apache.org/jira/browse/KAFKA-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057634#comment-17057634 ] Cheng Tan commented on KAFKA-7983: -- [https://github.com/apache/kafka/pull/8283] > supporting replication.throttled.replicas in dynamic broker configuration > - > > Key: KAFKA-7983 > URL: https://issues.apache.org/jira/browse/KAFKA-7983 > Project: Kafka > Issue Type: New Feature > Components: core >Reporter: Jun Rao >Assignee: Cheng Tan >Priority: Major > > In > [KIP-226|https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration#KIP-226-DynamicBrokerConfiguration-DefaultTopicconfigs], > we added the support to change broker defaults dynamically. However, it > didn't support changing leader.replication.throttled.replicas and > follower.replication.throttled.replicas. These 2 configs were introduced in > [KIP-73|https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas] > and controls the set of topic partitions on which replication throttling > will be engaged. One useful case is to be able to set a default value for > both configs to * to allow throttling to be engaged for all topic partitions. > Currently, the static default value for both configs are ignored for > replication throttling, it would be useful to fix that as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7983) supporting replication.throttled.replicas in dynamic broker configuration
[ https://issues.apache.org/jira/browse/KAFKA-7983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057633#comment-17057633 ] ASF GitHub Bot commented on KAFKA-7983: --- d8tltanc commented on pull request #8283: [WIP] KAFKA-7983: supporting replication.throttled.replicas in dynamic broker configuration URL: https://github.com/apache/kafka/pull/8283 **More detailed description of your change** > In KIP-226, we added the support to change broker defaults dynamically. However, it didn't support changing leader.replication.throttled.replicas and follower.replication.throttled.replicas. These 2 configs were introduced in KIP-73 and controls the set of topic partitions on which replication throttling will be engaged. One useful case is to be able to set a default value for both configs to * to allow throttling to be engaged for all topic partitions. Currently, the static default value for both configs are ignored for replication throttling, it would be useful to fix that as well. > leader.replication.throttled.replicas and follower.replication.throttled.replicas are dynamically set through ReplicationQuotaManager.markThrottled() at the topic level. However, these two properties don't exist at the broker level config and BrokerConfigHandler doesn't call ReplicationQuotaManager.markThrottled(). So, currently, we can't set leader.replication.throttled.replicas and follower.replication.throttled.replicas at the broker level either statically or dynamically. In this patch, we introduced two new dynamic broker configs, both of them are type of boolean: "leader.replication.throttled" (default: false) "follower.replication.throttled" (default: false) If "leader.replication.throttled" is set to "true", all leader brokers will be throttled. Similarly, if "follower.replication.throttled" is set to "true", all follower brokers will be throttled. The throttle mechanism is introduced in KIP-73. To implement the broker level throttle, I added a new class variable to ReplicationQuotaManager. The BrokerConfigHandler will call updateBrokerThrottle() and update this class variable upon receiving the config change notification from ZooKeeper. ReplicationQuotaManager::isThrottled() *Summary of testing strategy (including rationale) Added ReplicationQuotaManagerTest::shouldBrokerLevelThrottleAffectAllTopicPartition() to test if all topic partitions will be throttled when the broker is throttled. I'm currently working on adding more tests. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > supporting replication.throttled.replicas in dynamic broker configuration > - > > Key: KAFKA-7983 > URL: https://issues.apache.org/jira/browse/KAFKA-7983 > Project: Kafka > Issue Type: New Feature > Components: core >Reporter: Jun Rao >Assignee: Cheng Tan >Priority: Major > > In > [KIP-226|https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration#KIP-226-DynamicBrokerConfiguration-DefaultTopicconfigs], > we added the support to change broker defaults dynamically. However, it > didn't support changing leader.replication.throttled.replicas and > follower.replication.throttled.replicas. These 2 configs were introduced in > [KIP-73|https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas] > and controls the set of topic partitions on which replication throttling > will be engaged. One useful case is to be able to set a default value for > both configs to * to allow throttling to be engaged for all topic partitions. > Currently, the static default value for both configs are ignored for > replication throttling, it would be useful to fix that as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9709) add a confirm when kafka-server-stop.sh find multiple kafka instances to kill
[ https://issues.apache.org/jira/browse/KAFKA-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057519#comment-17057519 ] qiang Liu commented on KAFKA-9709: -- create a pull request [https://github.com/apache/kafka/pull/8275] which simply add a confirm. but as @rondagostino pointed out in that pull request, this change existing behavior, so maybe we should done this in a better way. thinking up some , list them below # add a delay like 5 seconds, if no user interaction, do kill all just as before, may add a arg like killall to skip the wait # create another script name it like kafka-server-stop-single.sh which will not do the kill when find multiple instance > add a confirm when kafka-server-stop.sh find multiple kafka instances to kill > - > > Key: KAFKA-9709 > URL: https://issues.apache.org/jira/browse/KAFKA-9709 > Project: Kafka > Issue Type: Improvement > Components: admin >Affects Versions: 2.4.1 >Reporter: qiang Liu >Priority: Minor > > currently kafka-server-stop.sh find all kafka instances on the machine and > kill them all with out any confirm, when deploy multity instance in one > machine, some may mistakenly kill all instance while only mean to kill single > one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9709) add a confirm when kafka-server-stop.sh find multiple kafka instances to kill
qiang Liu created KAFKA-9709: Summary: add a confirm when kafka-server-stop.sh find multiple kafka instances to kill Key: KAFKA-9709 URL: https://issues.apache.org/jira/browse/KAFKA-9709 Project: Kafka Issue Type: Improvement Components: admin Affects Versions: 2.4.1 Reporter: qiang Liu currently kafka-server-stop.sh find all kafka instances on the machine and kill them all with out any confirm, when deploy multity instance in one machine, some may mistakenly kill all instance while only mean to kill single one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-5972) Flatten SMT does not work with null values
[ https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057468#comment-17057468 ] ASF GitHub Bot commented on KAFKA-5972: --- rhauch commented on pull request #4021: KAFKA-5972 Flatten SMT does not work with null values URL: https://github.com/apache/kafka/pull/4021 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flatten SMT does not work with null values > -- > > Key: KAFKA-5972 > URL: https://issues.apache.org/jira/browse/KAFKA-5972 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tomas Zuklys >Assignee: siva santhalingam >Priority: Minor > Labels: easyfix, patch > Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.2.2, 2.4.0, 2.3.1 > > Attachments: kafka-transforms.patch > > > Hi, > I noticed a bug in Flatten SMT while doing tests with different SMTs that are > provided out-of-box. > Flatten SMT does not work as expected with schemaless JSON that has > properties with null values. > Example json: > {code} > {A={D=dValue, B=null, C=cValue}} > {code} > The issue is in if statement that checks for null value. > Current version: > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > return; > } > ... > {code} > should be > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > continue; > } > {code} > I have attached a patch containing the fix for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9701) Consumer could catch InconsistentGroupProtocolException during rebalance
[ https://issues.apache.org/jira/browse/KAFKA-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057459#comment-17057459 ] ASF GitHub Bot commented on KAFKA-9701: --- guozhangwang commented on pull request #8272: KAFKA-9701: Add more debug log on client to reproduce the issue URL: https://github.com/apache/kafka/pull/8272 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consumer could catch InconsistentGroupProtocolException during rebalance > > > Key: KAFKA-9701 > URL: https://issues.apache.org/jira/browse/KAFKA-9701 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.5.0 >Reporter: Boyang Chen >Priority: Major > > INFO log shows that we accidentally hit an unexpected inconsistent group > protocol exception: > [2020-03-10T17:16:53-07:00] > (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) > [2020-03-11 *00:16:53,382*] INFO > [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] > stream-client [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949] State > transition from REBALANCING to RUNNING (org.apache.kafka.streams.KafkaStreams) > > [2020-03-10T17:16:53-07:00] > (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) > [2020-03-11 *00:16:53,384*] WARN [kafka-producer-network-thread | > stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-0_1-producer] > stream-thread > [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] task > [0_1] Error sending record to topic node-name-repartition due to Producer > attempted an operation with an old epoch. Either there is a newer producer > with the same transactionalId, or the producer's transaction has been expired > by the broker.; No more records will be sent and no more offsets will be > recorded for this task. > > > [2020-03-10T17:16:53-07:00] > (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) > [2020-03-11 *00:16:53,521*] INFO > [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] > [Consumer > clientId=stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-consumer, > groupId=stream-soak-test] Member > stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1-consumer-d1c3c796-0bfb-4c1c-9fb4-5a807d8b53a2 > sending LeaveGroup request to coordinator > ip-172-31-20-215.us-west-2.compute.internal:9092 (id: 2147482646 rack: null) > due to the consumer unsubscribed from all topics > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > > [2020-03-10T17:16:54-07:00] > (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) > [2020-03-11 *00:16:53,798*] ERROR > [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] > stream-thread > [stream-soak-test-d3da8597-c371-450e-81d9-72aea6a26949-StreamThread-1] > Encountered the following unexpected Kafka exception during processing, this > usually indicate Streams internal errors: > (org.apache.kafka.streams.processor.internals.StreamThread) > [2020-03-10T17:16:54-07:00] > (streams-soak-2-5-eos-broker-2-5_soak_i-00067445452c82fe8_streamslog) > org.apache.kafka.common.errors.InconsistentGroupProtocolException: The group > member's supported protocols are incompatible with those of existing members > or first group member tried to join with empty protocol type or empty > protocol list. > > Potentially needs further log to understand this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-6145) Warm up new KS instances before migrating tasks - potentially a two phase rebalance
[ https://issues.apache.org/jira/browse/KAFKA-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057452#comment-17057452 ] ASF GitHub Bot commented on KAFKA-6145: --- ableegoldman commented on pull request #8282: KAFKA-6145: add new assignment configs URL: https://github.com/apache/kafka/pull/8282 For KIP-441 we intend to add 4 new configs: 1. assignment.acceptable.recovery.lag 2. assignment.balance.factor 3. assignment.max.extra.replicas 4. assignment.probing.rebalance.interval.ms I think we should give them all a common prefix to make it clear they're related, but am open to better suggestions than just "assignment" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Warm up new KS instances before migrating tasks - potentially a two phase > rebalance > --- > > Key: KAFKA-6145 > URL: https://issues.apache.org/jira/browse/KAFKA-6145 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Antony Stubbs >Assignee: Sophie Blee-Goldman >Priority: Major > Labels: needs-kip > > Currently when expanding the KS cluster, the new node's partitions will be > unavailable during the rebalance, which for large states can take a very long > time, or for small state stores even more than a few ms can be a deal breaker > for micro service use cases. > One workaround would be two execute the rebalance in two phases: > 1) start running state store building on the new node > 2) once the state store is fully populated on the new node, only then > rebalance the tasks - there will still be a rebalance pause, but would be > greatly reduced > Relates to: KAFKA-6144 - Allow state stores to serve stale reads during > rebalance -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9605) EOS Producer could throw illegal state if trying to complete a failed batch after fatal error
[ https://issues.apache.org/jira/browse/KAFKA-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-9605. Resolution: Fixed > EOS Producer could throw illegal state if trying to complete a failed batch > after fatal error > - > > Key: KAFKA-9605 > URL: https://issues.apache.org/jira/browse/KAFKA-9605 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0, 2.5.0 >Reporter: Boyang Chen >Assignee: Boyang Chen >Priority: Major > Fix For: 2.6.0 > > > In the Producer we could see network client hits fatal exception while trying > to complete the batches after Txn manager hits fatal fenced error: > {code:java} > > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,673] ERROR [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Aborting producer batches due to fatal > error (org.apache.kafka.clients.producer.internals.Sender) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) > org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an > operation with an old epoch. Either there is a newer producer with the same > transactionalId, or the producer's transaction has been expired by the broker. > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,674] INFO > [stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-0_0-producer, > transactionalId=stream-soak-test-0_0] Closing the Kafka producer with > timeoutMillis = 9223372036854775807 ms. > (org.apache.kafka.clients.producer.KafkaProducer) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,684] INFO [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Resetting sequence number of batch > with current sequence 354277 for partition windowed-node-counts-0 to 354276 > (org.apache.kafka.clients.producer.internals.TransactionManager) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,684] INFO [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > Resetting sequence number of batch with current sequence 354277 for > partition windowed-node-counts-0 to 354276 > (org.apache.kafka.clients.producer.internals.ProducerBatch) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,685] ERROR [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Uncaught error in request completion: > (org.apache.kafka.clients.NetworkClient) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) > java.lang.IllegalStateException: Should not reopen a batch which is already > aborted. > at > org.apache.kafka.common.record.MemoryRecordsBuilder.reopenAndRewriteProducerState(MemoryRecordsBuilder.java:295) > at > org.apache.kafka.clients.producer.internals.ProducerBatch.resetProducerState(ProducerBatch.java:395) > at > org.apache.kafka.clients.producer.internals.TransactionManager.lambda$adjustSequencesDueToFailedBatch$4(TransactionManager.java:770) > at > org.apache.kafka.clients.producer.internals.TransactionManager$TopicPartitionEntry.resetSequenceNumbers(TransactionManager.java:180) > at > org.apache.kafka.clients.producer.internals.TransactionManager.adjustSequencesDueToFailedBatch(TransactionManager.java:760) > at > org.apache.kafka.clients.producer.internals.TransactionManager.handleFailedBatch(TransactionManager.java:735) > at > org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:671) > at > org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:662) > at > org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:620)
[jira] [Commented] (KAFKA-9605) EOS Producer could throw illegal state if trying to complete a failed batch after fatal error
[ https://issues.apache.org/jira/browse/KAFKA-9605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057443#comment-17057443 ] ASF GitHub Bot commented on KAFKA-9605: --- hachikuji commented on pull request #8177: KAFKA-9605: Do not attempt to abort batches when txn manager is in fatal error URL: https://github.com/apache/kafka/pull/8177 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > EOS Producer could throw illegal state if trying to complete a failed batch > after fatal error > - > > Key: KAFKA-9605 > URL: https://issues.apache.org/jira/browse/KAFKA-9605 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0, 2.5.0 >Reporter: Boyang Chen >Assignee: Boyang Chen >Priority: Major > Fix For: 2.6.0 > > > In the Producer we could see network client hits fatal exception while trying > to complete the batches after Txn manager hits fatal fenced error: > {code:java} > > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,673] ERROR [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Aborting producer batches due to fatal > error (org.apache.kafka.clients.producer.internals.Sender) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) > org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an > operation with an old epoch. Either there is a newer producer with the same > transactionalId, or the producer's transaction has been expired by the broker. > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,674] INFO > [stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-0_0-producer, > transactionalId=stream-soak-test-0_0] Closing the Kafka producer with > timeoutMillis = 9223372036854775807 ms. > (org.apache.kafka.clients.producer.KafkaProducer) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,684] INFO [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Resetting sequence number of batch > with current sequence 354277 for partition windowed-node-counts-0 to 354276 > (org.apache.kafka.clients.producer.internals.TransactionManager) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,684] INFO [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > Resetting sequence number of batch with current sequence 354277 for > partition windowed-node-counts-0 to 354276 > (org.apache.kafka.clients.producer.internals.ProducerBatch) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) [2020-02-24 > 21:23:28,685] ERROR [kafka-producer-network-thread | > stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer] > [Producer > clientId=stream-soak-test-5e16fa60-12a3-4c4f-9900-c75f7d10859f-StreamThread-3-1_0-producer, > transactionalId=stream-soak-test-1_0] Uncaught error in request completion: > (org.apache.kafka.clients.NetworkClient) > [2020-02-24T13:23:29-08:00] > (streams-soak-trunk-eos_soak_i-02ea56d369c55eec2_streamslog) > java.lang.IllegalStateException: Should not reopen a batch which is already > aborted. > at > org.apache.kafka.common.record.MemoryRecordsBuilder.reopenAndRewriteProducerState(MemoryRecordsBuilder.java:295) > at > org.apache.kafka.clients.producer.internals.ProducerBatch.resetProducerState(ProducerBatch.java:395) > at > org.apache.kafka.clients.producer.internals.TransactionManager.lambda$adjustSequencesDueToFailedBatch$4(TransactionManager.java:770) > at > org.apache.kafka.clients.producer.internals.TransactionManager$TopicPartitionEntry.resetSequenceNumbers(TransactionManager.java:180) > at > org.a
[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
[ https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057442#comment-17057442 ] Matthias J. Sax commented on KAFKA-7965: [https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5122/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/] > Flaky Test > ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup > > > Key: KAFKA-7965 > URL: https://issues.apache.org/jira/browse/KAFKA-7965 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, unit tests >Affects Versions: 1.1.1, 2.2.0, 2.3.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > Fix For: 2.3.0 > > > To get stable nightly builds for `2.2` release, I create tickets for all > observed test failures. > [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/] > {quote}java.lang.AssertionError: Received 0, expected at least 68 at > org.junit.Assert.fail(Assert.java:88) at > org.junit.Assert.assertTrue(Assert.java:41) at > kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) > at > kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320) > at > kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9451) Pass consumer group metadata to producer on commit
[ https://issues.apache.org/jira/browse/KAFKA-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057441#comment-17057441 ] ASF GitHub Bot commented on KAFKA-9451: --- mjsax commented on pull request #8215: KAFKA-9451: Update MockConsumer to support ConsumerGroupMetadata URL: https://github.com/apache/kafka/pull/8215 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Pass consumer group metadata to producer on commit > -- > > Key: KAFKA-9451 > URL: https://issues.apache.org/jira/browse/KAFKA-9451 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: Matthias J. Sax >Assignee: Matthias J. Sax >Priority: Major > > Using producer per thread EOS design, we need to pass the consumer group > metadata into `producer.sendOffsetsToTransaction()` to use the new consumer > group coordinator fenchning mechanism. We should also reduce the default > transaction timeout to 10 seconds (compare the KIP for details). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9708) Connector does not prefer to use packaged classes during configuration
[ https://issues.apache.org/jira/browse/KAFKA-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057438#comment-17057438 ] ASF GitHub Bot commented on KAFKA-9708: --- gharris1727 commented on pull request #8281: KAFKA-9708: Use PluginClassLoader during connector startup URL: https://github.com/apache/kafka/pull/8281 * Use classloading prioritization from Worker::startTask in Worker::startConnector Signed-off-by: Greg Harris *Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.* ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Connector does not prefer to use packaged classes during configuration > -- > > Key: KAFKA-9708 > URL: https://issues.apache.org/jira/browse/KAFKA-9708 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > > In connector tasks, classes loaded during configuration are preferentially > loaded from the PluginClassLoader since KAFKA-8819 was implemented. This same > prioritization is not currently respected in the connector itself, where the > delegating classloader is used as the context classloader. This leads to the > possibility for different versions of converters to be loaded, or different > versions of dependencies to be found when executing code in the connector vs > task. > Worker::startConnector should be changed to follow the startTask / KAFKA-8819 > prioritization scheme, by activating the PluginClassLoader earlier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9708) Connector does not prefer to use packaged classes during configuration
Greg Harris created KAFKA-9708: -- Summary: Connector does not prefer to use packaged classes during configuration Key: KAFKA-9708 URL: https://issues.apache.org/jira/browse/KAFKA-9708 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Greg Harris Assignee: Greg Harris In connector tasks, classes loaded during configuration are preferentially loaded from the PluginClassLoader since KAFKA-8819 was implemented. This same prioritization is not currently respected in the connector itself, where the delegating classloader is used as the context classloader. This leads to the possibility for different versions of converters to be loaded, or different versions of dependencies to be found when executing code in the connector vs task. Worker::startConnector should be changed to follow the startTask / KAFKA-8819 prioritization scheme, by activating the PluginClassLoader earlier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9707) InsertField.Key transformation does not apply to tombstone records
[ https://issues.apache.org/jira/browse/KAFKA-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057429#comment-17057429 ] ASF GitHub Bot commented on KAFKA-9707: --- gharris1727 commented on pull request #8280: KAFKA-9707: Fix InsertField.Key not applying to tombstone events URL: https://github.com/apache/kafka/pull/8280 * Fix typo that hardcoded .value() instead of abstract operatingValue * Add test for Key transform that was previously not tested Signed-off-by: Greg Harris ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > InsertField.Key transformation does not apply to tombstone records > -- > > Key: KAFKA-9707 > URL: https://issues.apache.org/jira/browse/KAFKA-9707 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > > Reproduction steps: > # Configure an InsertField.Key transformation > # Pass a tombstone record (with non-null key, but null value) through the > transform > Expected behavior: > The key field is inserted, and the value remains null > Observed behavior: > The key field is not inserted, and the value remains null -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9707) InsertField.Key transformation does not apply to tombstone records
Greg Harris created KAFKA-9707: -- Summary: InsertField.Key transformation does not apply to tombstone records Key: KAFKA-9707 URL: https://issues.apache.org/jira/browse/KAFKA-9707 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Greg Harris Assignee: Greg Harris Reproduction steps: # Configure an InsertField.Key transformation # Pass a tombstone record (with non-null key, but null value) through the transform Expected behavior: The key field is inserted, and the value remains null Observed behavior: The key field is not inserted, and the value remains null -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9706) Flatten transformation fails when encountering tombstone event
[ https://issues.apache.org/jira/browse/KAFKA-9706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057408#comment-17057408 ] ASF GitHub Bot commented on KAFKA-9706: --- gharris1727 commented on pull request #8279: KAFKA-9706: Handle null keys/values in Flatten transformation URL: https://github.com/apache/kafka/pull/8279 * Fix DataException thrown when handling tombstone events with null value * Passes through original record when finding a tombstone record * Add tests for schema and schemaless data Signed-off-by: Greg Harris ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flatten transformation fails when encountering tombstone event > -- > > Key: KAFKA-9706 > URL: https://issues.apache.org/jira/browse/KAFKA-9706 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Greg Harris >Assignee: Greg Harris >Priority: Major > > When applying the {{Flatten}} transformation to a tombstone event, an > exception is raised: > {code:java} > org.apache.kafka.connect.errors.DataException: Only Map objects supported in > absence of schema for [flattening], found: null > {code} > Instead, the transform should pass the tombstone through the transform > without throwing an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9706) Flatten transformation fails when encountering tombstone event
Greg Harris created KAFKA-9706: -- Summary: Flatten transformation fails when encountering tombstone event Key: KAFKA-9706 URL: https://issues.apache.org/jira/browse/KAFKA-9706 Project: Kafka Issue Type: Bug Components: KafkaConnect Reporter: Greg Harris Assignee: Greg Harris When applying the {{Flatten}} transformation to a tombstone event, an exception is raised: {code:java} org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [flattening], found: null {code} Instead, the transform should pass the tombstone through the transform without throwing an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-1231) Support partition shrink (delete partition)
[ https://issues.apache.org/jira/browse/KAFKA-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057392#comment-17057392 ] Sönke Liebau commented on KAFKA-1231: - To be honest I think this caters to a bit of a fringe case and would create more issues than it could potentially solve - and since this issue has been open and uncommented for going on 6 years, anybody opposed to closing it? > Support partition shrink (delete partition) > --- > > Key: KAFKA-1231 > URL: https://issues.apache.org/jira/browse/KAFKA-1231 > Project: Kafka > Issue Type: Wish > Components: core >Affects Versions: 0.8.0 >Reporter: Marc Labbe >Priority: Minor > > Topics may have to be sized down once its peak usage has been passed. It > would be interesting to be able to reduce the number of partitions for a > topic. > In reference to http://search-hadoop.com/m/4TaT4hQiGt1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-1518) KafkaMetricsReporter prevents Kafka from starting if the custom reporter throws an exception
[ https://issues.apache.org/jira/browse/KAFKA-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057389#comment-17057389 ] Sönke Liebau commented on KAFKA-1518: - Personally I'd say, that this behavior is fine. If you don't want Kafka to stop then the reporter should catch and ignore the exception. But if an exception is propagated up to Kafka, then it shouldn't just be ignored. > KafkaMetricsReporter prevents Kafka from starting if the custom reporter > throws an exception > > > Key: KAFKA-1518 > URL: https://issues.apache.org/jira/browse/KAFKA-1518 > Project: Kafka > Issue Type: Bug > Components: tools >Affects Versions: 0.8.1.1 >Reporter: Daniel Compton >Priority: Major > > When Kafka starts up, it > [starts|https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/metrics/KafkaMetricsReporter.scala#L60] > custom metrics reporters. If any of these throw an exception on startup then > this will block Kafka from starting. > For example using > [kafka-riemann-reporter|https://github.com/pingles/kafka-riemann-reporter], > if Riemann is not available then it will throw an [IO > Exception|https://github.com/pingles/kafka-riemann-reporter/issues/1] which > isn't caught by KafkaMetricsReporter. This means that Kafka will fail to > start until it can connect to Riemann, coupling a HA system to a non HA > system. > It would probably be better to log the error and perhaps have a callback hook > to the reporter where they can handle it and startup in the background. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-1440) Per-request tracing
[ https://issues.apache.org/jira/browse/KAFKA-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057379#comment-17057379 ] Sönke Liebau commented on KAFKA-1440: - Looking through some old tickets and came across this. While certainly useful in principle, this seems like a fairly large piece of work by now. Can we close this, or is this something that someone feels strongly about and wants to look at? > Per-request tracing > --- > > Key: KAFKA-1440 > URL: https://issues.apache.org/jira/browse/KAFKA-1440 > Project: Kafka > Issue Type: Improvement > Components: clients >Reporter: Todd Palino >Priority: Major > > Could we have a flag in requests (potentially in the header for all requests, > but at least for produce and fetch) that would enable tracing for that > request. Currently, if we want to debug an issue with a request, we need to > turn on trace level logging for the entire broker. If the client could ask > for the request to be traced, we could then log detailed information for just > that request. > Ideally, this would be available as a flag that can be enabled in the client > via JMX, so it could be done without needing to restart the client > application. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-5972) Flatten SMT does not work with null values
[ https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris updated KAFKA-5972: --- Fix Version/s: 2.2.2 2.4.0 2.3.1 2.1.2 2.0.2 1.1.2 1.0.3 > Flatten SMT does not work with null values > -- > > Key: KAFKA-5972 > URL: https://issues.apache.org/jira/browse/KAFKA-5972 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tomas Zuklys >Assignee: siva santhalingam >Priority: Minor > Labels: easyfix, patch > Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.2.2, 2.4.0, 2.3.1 > > Attachments: kafka-transforms.patch > > > Hi, > I noticed a bug in Flatten SMT while doing tests with different SMTs that are > provided out-of-box. > Flatten SMT does not work as expected with schemaless JSON that has > properties with null values. > Example json: > {code} > {A={D=dValue, B=null, C=cValue}} > {code} > The issue is in if statement that checks for null value. > Current version: > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > return; > } > ... > {code} > should be > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > continue; > } > {code} > I have attached a patch containing the fix for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-5972) Flatten SMT does not work with null values
[ https://issues.apache.org/jira/browse/KAFKA-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Harris resolved KAFKA-5972. Resolution: Fixed > Flatten SMT does not work with null values > -- > > Key: KAFKA-5972 > URL: https://issues.apache.org/jira/browse/KAFKA-5972 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 0.11.0.0, 0.11.0.1 >Reporter: Tomas Zuklys >Assignee: siva santhalingam >Priority: Minor > Labels: easyfix, patch > Fix For: 1.0.3, 1.1.2, 2.0.2, 2.1.2, 2.3.1, 2.4.0, 2.2.2 > > Attachments: kafka-transforms.patch > > > Hi, > I noticed a bug in Flatten SMT while doing tests with different SMTs that are > provided out-of-box. > Flatten SMT does not work as expected with schemaless JSON that has > properties with null values. > Example json: > {code} > {A={D=dValue, B=null, C=cValue}} > {code} > The issue is in if statement that checks for null value. > Current version: > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > return; > } > ... > {code} > should be > {code} > for (Map.Entry entry : originalRecord.entrySet()) { > final String fieldName = fieldName(fieldNamePrefix, > entry.getKey()); > Object value = entry.getValue(); > if (value == null) { > newRecord.put(fieldName(fieldNamePrefix, entry.getKey()), > null); > continue; > } > {code} > I have attached a patch containing the fix for this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-1336) Create partition compaction analyzer
[ https://issues.apache.org/jira/browse/KAFKA-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057373#comment-17057373 ] Sönke Liebau commented on KAFKA-1336: - While I agree in principle that this information might be useful, the age and inactivity of this ticket suggests to me that this is not a priority and might be closed? > Create partition compaction analyzer > > > Key: KAFKA-1336 > URL: https://issues.apache.org/jira/browse/KAFKA-1336 > Project: Kafka > Issue Type: New Feature >Reporter: Jay Kreps >Priority: Major > > It would be nice to have a tool that given a topic and partition reads the > full data and outputs: > 1. The percentage of records that are duplicates > 2. The percentage of records that have been deleted -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-1292) Command-line tools should print what they do as part of their usage command
[ https://issues.apache.org/jira/browse/KAFKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sönke Liebau resolved KAFKA-1292. - Resolution: Fixed This has been implemented by now, I do believe within the linked ticket. > Command-line tools should print what they do as part of their usage command > --- > > Key: KAFKA-1292 > URL: https://issues.apache.org/jira/browse/KAFKA-1292 > Project: Kafka > Issue Type: Improvement >Reporter: Jay Kreps >Priority: Major > Labels: usability > > It would be nice if you could get an explanation for each tool by running the > command with no arguments. Something like > bin/kafka-preferred-replica-election.sh is a little scary so it would be nice > to have it self-describe: > > bin/kafka-preferred-replica-election.sh > This command attempts to return leadership to the preferred replicas (if they > are alive) from whomever is currently the leader). > Option Description > > -- --- > > --alter Alter the configuration for the topic. > ... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-1265) SBT and Gradle create jars without expected Maven files
[ https://issues.apache.org/jira/browse/KAFKA-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057365#comment-17057365 ] Sönke Liebau commented on KAFKA-1265: - While it still holds true that the jar itself doesn't contain these files, the appropriate pom is published to maven central by Gradle and serves this purpose to a large extent. Also, as this has been open and uncommented for more than 6 years now, the pain seems not to be too great here :) I'd suggest we can close this? > SBT and Gradle create jars without expected Maven files > --- > > Key: KAFKA-1265 > URL: https://issues.apache.org/jira/browse/KAFKA-1265 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.0, 0.8.1 >Reporter: Clark Breyman >Priority: Minor > Labels: build > Original Estimate: 48h > Remaining Estimate: 48h > > The jar files produced and deployed to maven central do not embed the > expected Maven pom.xml and pom.properties files as would be expected by a > standard Maven-build artifact. This results in jars that do not self-document > their versions or dependencies. > For reference to the maven behavior, see addMavenDescriptor (defaults to > true): http://maven.apache.org/shared/maven-archiver/#archive > Worst case, these files would need to be generated by the build and included > in the jar file. Gradle doesn't seem to have an automatic mechanism for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057352#comment-17057352 ] Raman Gupta edited comment on KAFKA-8803 at 3/11/20, 8:20 PM: -- Here are the logs from this morning, bracketing the problem by an hour or so on each side, and includes logs from the broker shutdowns and restarts. The first instance of `UNKNOWN_LEADER_EPOCH` is at 14:38:36. Our client-side streams were failing from 14:59: to 16:40, the first instance of the error is at 14:59:41. The error always seems to occur for the first time after the client restarts, so its quite possible its related to a streams shutdown process, although I don't think so because the UNKNOWN_LEADER_EPOCH happened earlier. It seems like whatever the issue was occurred on the broker side, but as long as the client side didn't restart, things were ok. Just a theory though. There is no IllegalStateException as seen by [~oleksii.boiko]. [^logs-20200311.txt.gz] [^logs-client-20200311.txt.gz] > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman > Priority: Major > Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, > logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raman Gupta updated KAFKA-8803: --- Attachment: logs-client-20200311.txt.gz logs-20200311.txt.gz > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, > logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057325#comment-17057325 ] Jun Rao commented on KAFKA-9693: [~paolomoriello]: By metadata, I was referring to the metadata in the file system (e.g., file length, timestamp, etc). Basically, I am just wondering what contention is causing the producer spike. The producer is writing to a different file from the one being flushed. So, there is no obvious data contention. > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I observed the same spiky pattern as in > the producer latency, with a one-to-one correspondence with the append > duration. > !image-2020-0
[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057313#comment-17057313 ] Paolo Moriello commented on KAFKA-9693: --- [~junrao] thank you again for you replay. To clarify, when you say metadata, are you referring to the indexes or also something else? In this case, it is likely due to a file system data contention only. I tried to temporarily disable the data log flush and got the same result in terms of latency improvement. Moreover, I observed the same spikes in the fswrite of the data itself. > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I observed the same spiky pattern as in > the producer
[jira] [Created] (KAFKA-9705) (Incremental)AlterConfig should be propagated from Controller in bridge release
Boyang Chen created KAFKA-9705: -- Summary: (Incremental)AlterConfig should be propagated from Controller in bridge release Key: KAFKA-9705 URL: https://issues.apache.org/jira/browse/KAFKA-9705 Project: Kafka Issue Type: Sub-task Reporter: Boyang Chen Assignee: Boyang Chen In the bridge release, we need to restrict the direct access of ZK to controller only. This means the existing AlterConfig path should be migrated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057296#comment-17057296 ] Guozhang Wang commented on KAFKA-8803: -- Raman, could you share the broker-side logs (everything, the server logs, controller logs, anything you have) so that we can help investigating? Did you see the IllegalStateExceptio that Oleskii saw? > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057291#comment-17057291 ] Guozhang Wang commented on KAFKA-8803: -- [~rocketraman] Also about your reported IllegalStateException, I just found that in later versions we have actually removed that illegal-state check as a warning: https://github.com/apache/kafka/pull/7840, this is as part of 2.5.0. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285 ] Guozhang Wang edited comment on KAFKA-8803 at 3/11/20, 6:09 PM: [~rocketraman] "Re: I don't think that is going to help. We currently let-it-crash in this situation, and retry on the next startup, and when this error happens, these retries do not work until the brokers are restarted." I guess my previous reply is a bit misleading. What I meant is that from our investigation, some initPID timeout are actually transient -- it is just taking longer than 5 minutes -- due to the partitions unavailable and hence abortion record cannot be written and replicated, and hence the transition to abort-prepare cannot be completed. In this case retries would help. In other cases, we also observe that the timeout is actually permanent (this is I believe you've encountered), and that we believe is a broker-side issue that we are still investigating, as I tried to explain in 2). was (Author: guozhang): [~rocketraman] I guess my previous reply is a bit misleading. What I meant is that from our investigation, some initPID timeout are actually transient -- it is just taking longer than 5 minutes -- due to the partitions unavailable and hence abortion record cannot be written and replicated, and hence the transition to abort-prepare cannot be completed. In this case retries would help. In other cases, we also observe that the timeout is actually permanent (this is I believe you've encountered), and that we believe is a broker-side issue that we are still investigating, as I tried to explain in 2). > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285 ] Guozhang Wang commented on KAFKA-8803: -- [~rocketraman] "and due to time shift which may indeed be smaller if timer goes backwards. We should have a follow-up PR as in https://github.com/apache/kafka/pull/3286 to release this check. I will send a short PR shortly." I guess my previous reply is a bit misleading. What I meant is that from our investigation, some initPID timeout are actually transient -- it is just taking longer than 5 minutes -- due to the partitions unavailable and hence abortion record cannot be written and replicated, and hence the transition to abort-prepare cannot be completed. In this case retries would help. In other cases, we also observe that the timeout is actually permanent (this is I believe you've encountered), and that we believe is a broker-side issue that we are still investigating, as I tried to explain in 2). > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057285#comment-17057285 ] Guozhang Wang edited comment on KAFKA-8803 at 3/11/20, 6:08 PM: [~rocketraman] I guess my previous reply is a bit misleading. What I meant is that from our investigation, some initPID timeout are actually transient -- it is just taking longer than 5 minutes -- due to the partitions unavailable and hence abortion record cannot be written and replicated, and hence the transition to abort-prepare cannot be completed. In this case retries would help. In other cases, we also observe that the timeout is actually permanent (this is I believe you've encountered), and that we believe is a broker-side issue that we are still investigating, as I tried to explain in 2). was (Author: guozhang): [~rocketraman] "and due to time shift which may indeed be smaller if timer goes backwards. We should have a follow-up PR as in https://github.com/apache/kafka/pull/3286 to release this check. I will send a short PR shortly." I guess my previous reply is a bit misleading. What I meant is that from our investigation, some initPID timeout are actually transient -- it is just taking longer than 5 minutes -- due to the partitions unavailable and hence abortion record cannot be written and replicated, and hence the transition to abort-prepare cannot be completed. In this case retries would help. In other cases, we also observe that the timeout is actually permanent (this is I believe you've encountered), and that we believe is a broker-side issue that we are still investigating, as I tried to explain in 2). > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057281#comment-17057281 ] Raman Gupta commented on KAFKA-8803: [~guozhang] FYI we had this happen again today, and this time I did NOT see any errors similar to "The metadata cache for txn partition 22 has already exist with epoch 567 and 9 entries while trying to add to it; this should not happen" error, so it may or may not be related (probably not). > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057279#comment-17057279 ] Guozhang Wang commented on KAFKA-8803: -- https://github.com/apache/kafka/pull/8269 > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057276#comment-17057276 ] ASF GitHub Bot commented on KAFKA-8803: --- guozhangwang commented on pull request #8278: KAFKA-8803: Remove timestamp check in completeTransitionTo URL: https://github.com/apache/kafka/pull/8278 In `prepareAddPartitions` the txnStartTimestamp could be updated as updateTimestamp, which is assumed to be always larger then the original startTimestamp. However, due to ntp time shift the timer may go backwards and hence the newStartTimestamp be smaller than the original one. Then later in completeTransitionTo the time check would fail with an IllegalStateException, and the txn would not transit to Ongoing. An indirect result of this, is that this txn would NEVER be expired anymore because only Ongoing ones would be checked for expiration. We should do the same as in https://github.com/apache/kafka/pull/3286 to remove this check. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057269#comment-17057269 ] Guozhang Wang commented on KAFKA-8803: -- [~oleksii.boiko] indeed had a great observation about one root cause of it: in `prepareAddPartitions` the txnStartTimestamp is updated as updateTimestamp, and due to time shift which may indeed be smaller if timer goes backwards. We should have a follow-up PR as in https://github.com/apache/kafka/pull/3286 to release this check. I will send a short PR shortly. [~rocketraman] I'm still investigating your observation of "The metadata cache for txn partition 22 has already exist with epoch 567 and 9 entries while trying to add to it; this should not happen", stay tuned. cc [~hachikuji] > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057262#comment-17057262 ] Raman Gupta commented on KAFKA-8803: Thanks [~guozhang]. I'm a little confused by your response. Regarding the timeout and timeout retry, as mentioned previously in this issue, I don't think that is going to help. We currently let-it-crash in this situation, and retry on the next startup, and when this error happens, these retries do not work until the brokers are restarted. Whatever state the brokers are in, is not fixed by retries nor by longer timeouts. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057256#comment-17057256 ] Jun Rao commented on KAFKA-9693: [~paolomoriello]: Thanks for the reply. Do you know if the increase latency is due to contention on the data or metadata in the file system? If it's the data, it seems that by delaying the flush long enough, the producer would have moved to a different block from the one being flushed. If it's the metadata, we flush other checkpoint files periodically and I am wondering it they cause producer latency spike too. > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I observed the same spiky pattern as
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057241#comment-17057241 ] Guozhang Wang commented on KAFKA-8803: -- [~rocketraman] Yes here are the current status: 1) On the streams side, I've merged a PR https://github.com/apache/kafka/pull/8060 which would handle the timeout exception gracefully and retry, it is merged in trunk and hence would be in the 2.6.0 release (you can also try it out compiling from trunk). 2) The root cause of this long init-pid timing out, as far as we've investigated so far, is due to broker side partition availability and hence the txn record cannot be successfully written (note since the txn record is critical determining the state of the txn broker's require ack=all with min.isr in this append), and hence the current dangling txn cannot be aborted and thus the new PID cannot be granted. We're still working on the broker-side improvement to reduce the likelihood of this scenario that may cause timeout. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057191#comment-17057191 ] Raman Gupta commented on KAFKA-8803: Any progress on this? Still happening regularly for us. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll
[ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056893#comment-17056893 ] Paolo Moriello commented on KAFKA-9693: --- [~junrao] Thanks for the interest! To answer your questions: (1) the filesystem used in the test is ext4 with jbd2. (2) yes, we could get the same benefit by delaying the flush; however, I could see the same effect (latency spikes mitigated) only with a delay >= 30s (because of default kernel page cache configuration). I guess we could expose a configuration to control this delay, instead of on/off only as originally proposed. > Kafka latency spikes caused by log segment flush on roll > > > Key: KAFKA-9693 > URL: https://issues.apache.org/jira/browse/KAFKA-9693 > Project: Kafka > Issue Type: Improvement > Components: core > Environment: OS: Amazon Linux 2 > Kafka version: 2.2.1 >Reporter: Paolo Moriello >Assignee: Paolo Moriello >Priority: Major > Labels: Performance, latency, performance > Attachments: image-2020-03-10-13-17-34-618.png, > image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, > image-2020-03-10-15-00-54-204.png, latency_plot2.png > > > h1. Summary > When a log segment fills up, Kafka rolls over onto a new active segment and > force the flush of the old segment to disk. When this happens, log segment > _append_ duration increase causing important latency spikes on producer(s) > and replica(s). This ticket aims to highlight the problem and propose a > simple mitigation: add a new configuration to enable/disable rolled segment > flush. > h1. 1. Phenomenon > Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to > ~50x-200x more than usual. For instance, normally 99th %ile is lower than > 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th > %iles even jump to 500-700ms. > Latency spikes happen at constant frequency (depending on the input > throughput), for small amounts of time. All the producers experience a > latency increase at the same time. > h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314! > {{Example of response time plot observed during on a single producer.}} > URPs rarely appear in correspondence of the latency spikes too. This is > harder to reproduce, but from time to time it is possible to see a few > partitions going out of sync in correspondence of a spike. > h1. 2. Experiment > h2. 2.1 Setup > Kafka cluster hosted on AWS EC2 instances. > h4. Cluster > * 15 Kafka brokers: (EC2 m5.4xlarge) > ** Disk: 1100Gb EBS volumes (4750Mbps) > ** Network: 10 Gbps > ** CPU: 16 Intel Xeon Platinum 8000 > ** Memory: 64Gb > * 3 Zookeeper nodes: m5.large > * 6 producers on 6 EC2 instances in the same region > * 1 topic, 90 partitions - replication factor=3 > h4. Broker config > Relevant configurations: > {quote}num.io.threads=8 > num.replica.fetchers=2 > offsets.topic.replication.factor=3 > num.network.threads=5 > num.recovery.threads.per.data.dir=2 > min.insync.replicas=2 > num.partitions=1 > {quote} > h4. Perf Test > * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per > broker) > * record size = 2 > * Acks = 1, linger.ms = 1, compression.type = none > * Test duration: ~20/30min > h2. 2.2 Analysis > Our analysis showed an high +correlation between log segment flush count/rate > and the latency spikes+. This indicates that the spikes in max latency are > related to Kafka behavior on rolling over new segments. > The other metrics did not show any relevant impact on any hardware component > of the cluster, eg. cpu, memory, network traffic, disk throughput... > > !latency_plot2.png|width=924,height=308! > {{Correlation between latency spikes and log segment flush count. p50, p95, > p99, p999 and p latencies (left axis, ns) and the flush #count (right > axis, stepping blue line in plot).}} > Kafka schedules logs flushing (this includes flushing the file record > containing log entries, the offset index, the timestamp index and the > transaction index) during _roll_ operations. A log is rolled over onto a new > empty log when: > * the log segment is full > * the maxtime has elapsed since the timestamp of first message in the > segment (or, in absence of it, since the create time) > * the index is full > In this case, the increase in latency happens on _append_ of a new message > set to the active segment of the log. This is a synchronous operation which > therefore blocks producers requests, causing the latency increase. > To confirm this, I instrumented Kafka to measure the duration of > FileRecords.append(MemoryRecords) method, which is responsible of writing > memory records to file. As a result, I obse
[jira] [Comment Edited] (KAFKA-9458) Kafka crashed in windows environment
[ https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021999#comment-17021999 ] hirik edited comment on KAFKA-9458 at 3/11/20, 10:13 AM: - [~manme...@gmail.com], I created a git pull request. can you please review the changes? [https://github.com/apache/kafka/pull/8276] was (Author: hirik): [~manme...@gmail.com], I created a git pull request. can you please review the changes? [https://github.com/apache/kafka/pull/8276] [|https://github.com/apache/kafka/pull/8002] > Kafka crashed in windows environment > > > Key: KAFKA-9458 > URL: https://issues.apache.org/jira/browse/KAFKA-9458 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 2.4.0 > Environment: Windows Server 2019 >Reporter: hirik >Priority: Critical > Labels: windows > Fix For: 2.6.0 > > Attachments: logs.zip > > > Hi, > while I was trying to validate Kafka retention policy, Kafka Server crashed > with below exception trace. > [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, > dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] > Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log) > [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in > dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka > (kafka.server.LogDirFailureChannel) > java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) > at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395) > at > java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292) > at java.base/java.nio.file.Files.move(Files.java:1425) > at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795) > at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209) > at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.Log.deleteSegmentFiles(Log.scala:2206) > at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191) > at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700) > at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17) > at kafka.log.Log.maybeHandleIOException(Log.scala:2316) > at kafka.log.Log.deleteSegments(Log.scala:1691) > at kafka.log.Log.deleteOldSegments(Log.scala:1686) > at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763) > at kafka.log.Log.deleteOldSegments(Log.scala:1753) > at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982) > at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.LogManager.cleanupLogs(LogManager.scala:979) > at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:830) > Suppressed: java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) > at java.base/sun.nio.fs.Wi
[jira] [Comment Edited] (KAFKA-9458) Kafka crashed in windows environment
[ https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021999#comment-17021999 ] hirik edited comment on KAFKA-9458 at 3/11/20, 10:13 AM: - [~manme...@gmail.com], I created a git pull request. can you please review the changes? [https://github.com/apache/kafka/pull/8276] [|https://github.com/apache/kafka/pull/8002] was (Author: hirik): [~manme...@gmail.com], I created a git pull request. can you please review the changes? [https://github.com/apache/kafka/pull/8002] > Kafka crashed in windows environment > > > Key: KAFKA-9458 > URL: https://issues.apache.org/jira/browse/KAFKA-9458 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 2.4.0 > Environment: Windows Server 2019 >Reporter: hirik >Priority: Critical > Labels: windows > Fix For: 2.6.0 > > Attachments: logs.zip > > > Hi, > while I was trying to validate Kafka retention policy, Kafka Server crashed > with below exception trace. > [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, > dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] > Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log) > [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in > dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka > (kafka.server.LogDirFailureChannel) > java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) > at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395) > at > java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292) > at java.base/java.nio.file.Files.move(Files.java:1425) > at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795) > at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209) > at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.Log.deleteSegmentFiles(Log.scala:2206) > at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191) > at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700) > at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17) > at kafka.log.Log.maybeHandleIOException(Log.scala:2316) > at kafka.log.Log.deleteSegments(Log.scala:1691) > at kafka.log.Log.deleteOldSegments(Log.scala:1686) > at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763) > at kafka.log.Log.deleteOldSegments(Log.scala:1753) > at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982) > at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.LogManager.cleanupLogs(LogManager.scala:979) > at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:830) > Suppressed: java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) > at java.base/sun.nio.fs.Wi
[jira] [Commented] (KAFKA-9458) Kafka crashed in windows environment
[ https://issues.apache.org/jira/browse/KAFKA-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056819#comment-17056819 ] ASF GitHub Bot commented on KAFKA-9458: --- hirik commented on pull request #8002: KAFKA-9458 Fix windows clean log fail caused Runtime.halt URL: https://github.com/apache/kafka/pull/8002 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Kafka crashed in windows environment > > > Key: KAFKA-9458 > URL: https://issues.apache.org/jira/browse/KAFKA-9458 > Project: Kafka > Issue Type: Bug > Components: log >Affects Versions: 2.4.0 > Environment: Windows Server 2019 >Reporter: hirik >Priority: Critical > Labels: windows > Fix For: 2.6.0 > > Attachments: logs.zip > > > Hi, > while I was trying to validate Kafka retention policy, Kafka Server crashed > with below exception trace. > [2020-01-21 17:10:40,475] INFO [Log partition=test1-3, > dir=C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka] > Rolled new log segment at offset 1 in 52 ms. (kafka.log.Log) > [2020-01-21 17:10:40,484] ERROR Error while deleting segments for test1-3 in > dir C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka > (kafka.server.LogDirFailureChannel) > java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) > at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:395) > at > java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:292) > at java.base/java.nio.file.Files.move(Files.java:1425) > at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:795) > at kafka.log.AbstractIndex.renameTo(AbstractIndex.scala:209) > at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:497) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1(Log.scala:2206) > at kafka.log.Log.$anonfun$deleteSegmentFiles$1$adapted(Log.scala:2206) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.Log.deleteSegmentFiles(Log.scala:2206) > at kafka.log.Log.removeAndDeleteSegments(Log.scala:2191) > at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1700) > at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.scala:17) > at kafka.log.Log.maybeHandleIOException(Log.scala:2316) > at kafka.log.Log.deleteSegments(Log.scala:1691) > at kafka.log.Log.deleteOldSegments(Log.scala:1686) > at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1763) > at kafka.log.Log.deleteOldSegments(Log.scala:1753) > at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:982) > at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:979) > at scala.collection.immutable.List.foreach(List.scala:305) > at kafka.log.LogManager.cleanupLogs(LogManager.scala:979) > at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:403) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) > at > java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:830) > Suppressed: java.nio.file.FileSystemException: > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex > -> > C:\Users\Administrator\Downloads\kafka\bin\windows\..\..\data\kafka\test1-3\.timeindex.deleted: > The process cannot access the file because it is being used by another > process. > at > java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:92) > at > java.base/sun.n
[jira] [Commented] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056795#comment-17056795 ] Nishant Ranjan commented on KAFKA-9702: --- Hi Alexandre, I attached hprof dump. What I did this time, I simply poll from topic which has 500,000 records. No processing. Leak suspects has report also. As mentioned earlier, I used MAT. > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > Attachments: java_pid8020.0001.hprof.7z, > java_pid8020.0001_Leak_Suspects.zip > > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishant Ranjan updated KAFKA-9702: -- Attachment: java_pid8020.0001_Leak_Suspects.zip > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > Attachments: java_pid8020.0001.hprof.7z, > java_pid8020.0001_Leak_Suspects.zip > > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishant Ranjan updated KAFKA-9702: -- Attachment: java_pid8020.0001.hprof.7z > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > Attachments: java_pid8020.0001.hprof.7z > > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9704) z/OS won't let us resize file when mmap
Shuo Zhang created KAFKA-9704: - Summary: z/OS won't let us resize file when mmap Key: KAFKA-9704 URL: https://issues.apache.org/jira/browse/KAFKA-9704 Project: Kafka Issue Type: Task Components: log Affects Versions: 2.4.0 Reporter: Shuo Zhang Fix For: 2.5.0, 2.4.2 z/OS won't let us resize file when mmap, so we need to force unman like Windows. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9703) ProducerBatch.split takes up too many resources if the bigBatch is huge
jiamei xie created KAFKA-9703: - Summary: ProducerBatch.split takes up too many resources if the bigBatch is huge Key: KAFKA-9703 URL: https://issues.apache.org/jira/browse/KAFKA-9703 Project: Kafka Issue Type: Bug Reporter: jiamei xie ProducerBatch.split takes up too many resources and might cause outOfMemory error if the bigBatch is huge. About how I found this issue is in https://lists.apache.org/list.html?us...@kafka.apache.org:lte=1M:MESSAGE_TOO_LARGE Following is the code which takes a lot of resources. {code:java} for (Record record : recordBatch) { assert thunkIter.hasNext(); Thunk thunk = thunkIter.next(); if (batch == null) batch = createBatchOffAccumulatorForRecord(record, splitBatchSize); // A newly created batch can always host the first message. if (!batch.tryAppendForSplit(record.timestamp(), record.key(), record.value(), record.headers(), thunk)) { batches.add(batch); batch = createBatchOffAccumulatorForRecord(record, splitBatchSize); batch.tryAppendForSplit(record.timestamp(), record.key(), record.value(), record.headers(), thunk); } {code} Refer to RecordAccumulator#tryAppend, we can call closeForRecordAppends() after a batch is full. {code:java} private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, Deque deque, long nowMs) { ProducerBatch last = deque.peekLast(); if (last != null) { FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, nowMs); if (future == null) last.closeForRecordAppends(); else return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false, false); } return null; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang closed KAFKA-9343. - > Add ps command for Kafka and zookeeper process on z/OS. > --- > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.5.0, 2.4.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > +Note: since the final change scope changed, I changed the summary and > description.+ > The existing method to check Kafka process for other platform doesn't > applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. > PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk > '\{print $1}') > --> > PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v > grep | awk '\{print $1}') > So does the zookeeper process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang updated KAFKA-9343: -- Description: +Note: since the final change scope changed, I changed the summary and description.+ The existing method to check Kafka process for other platform doesn't applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '\{print $1}') --> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep | awk '\{print $1}') So does the zookeeper process. > Add ps command for Kafka and zookeeper process on z/OS. > --- > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.5.0, 2.4.2 > > Original Estimate: 168h > Remaining Estimate: 168h > > +Note: since the final change scope changed, I changed the summary and > description.+ > The existing method to check Kafka process for other platform doesn't > applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. > PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk > '\{print $1}') > --> > PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v > grep | awk '\{print $1}') > So does the zookeeper process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang updated KAFKA-9343: -- Description: (was: +Note: since the final change scope changed, I changed the summary and description.+ The existing method to check Kafka process for other platform doesn't applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}') --> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep | awk '{print $1}') So does the zookeeper process.) > Add ps command for Kafka and zookeeper process on z/OS. > --- > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang updated KAFKA-9343: -- Fix Version/s: 2.4.2 > Add ps command for Kafka and zookeeper process on z/OS. > --- > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.5.0, 2.4.2 > > Original Estimate: 168h > Remaining Estimate: 168h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9343) Add ps command for Kafka and zookeeper process on z/OS.
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang updated KAFKA-9343: -- Fix Version/s: (was: 2.6.0) 2.5.0 Description: +Note: since the final change scope changed, I changed the summary and description.+ The existing method to check Kafka process for other platform doesn't applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk '{print $1}') --> PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v grep | awk '{print $1}') So does the zookeeper process. was:To make Kafka runnable on z/OS, I need some changes to the shell script and java/scala files, to Summary: Add ps command for Kafka and zookeeper process on z/OS. (was: Make Kafka shell runnable on z/OS) > Add ps command for Kafka and zookeeper process on z/OS. > --- > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > +Note: since the final change scope changed, I changed the summary and > description.+ > The existing method to check Kafka process for other platform doesn't > applicable for z/OS, on z/OS, the best keyword we can use is the JOBNAME. > PIDS=$(ps ax | grep -i 'kafka\.Kafka' | grep java | grep -v grep | awk > '{print $1}') > --> > PIDS=$(ps -A -o pid,jobname,comm | grep -i $JOBNAME | grep java | grep -v > grep | awk '{print $1}') > So does the zookeeper process. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9698) Wrong default max.message.bytes in document
[ https://issues.apache.org/jira/browse/KAFKA-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056751#comment-17056751 ] Tom Bentley commented on KAFKA-9698: KAFKA-4203 has been fixed for Kafka 2.5 which is not released yet, and therefore not reflected in the documentation currently at http://kafka.apache.org/documentation/. When 2.5 is released the documentation will be updated. > Wrong default max.message.bytes in document > --- > > Key: KAFKA-9698 > URL: https://issues.apache.org/jira/browse/KAFKA-9698 > Project: Kafka > Issue Type: Bug > Components: documentation >Affects Versions: 2.4.0 >Reporter: jiamei xie >Priority: Major > > The broker default for max.message.byte has been changed to 1048588 in > https://issues.apache.org/jira/browse/KAFKA-4203. But the default value in > http://kafka.apache.org/documentation/ is still 112. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9343) Make Kafka shell runnable on z/OS
[ https://issues.apache.org/jira/browse/KAFKA-9343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Zhang updated KAFKA-9343: -- Description: To make Kafka runnable on z/OS, I need some changes to the shell script and java/scala files, to (was: To make Kafka runnable on z/OS, I need to change the following 4 shell scripts: kafka-run-class.sh, kafka-server-start/sh, kafka-server-stop.sh, zookeeper-server-stop.sh ) > Make Kafka shell runnable on z/OS > - > > Key: KAFKA-9343 > URL: https://issues.apache.org/jira/browse/KAFKA-9343 > Project: Kafka > Issue Type: Task > Components: tools >Affects Versions: 2.4.0 > Environment: z/OS, OS/390 >Reporter: Shuo Zhang >Priority: Major > Labels: OS/390, z/OS > Fix For: 2.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > To make Kafka runnable on z/OS, I need some changes to the shell script and > java/scala files, to -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056739#comment-17056739 ] Alexandre Dupriez commented on KAFKA-9702: -- Could you please provide a link to a stored heap dump? Thanks! > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056738#comment-17056738 ] Nishant Ranjan commented on KAFKA-9702: --- When I execute proper code on linux server with approx 500,000/600,000 then memory shoots to 6GB and then I gets OOM (memory constraints). Above was sample from laptop on windows (kind-off "Hello World" of problem). To identify the memory issue, I just wrote 3 lines : while(true) { consumerRecords = consumer.poll(1000) ; consumerRecords = null; } Even with above MAT is showing probable leak. Maybe, I am missing some Kafka option but there is some memory issue. > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9702) Suspected memory leak
[ https://issues.apache.org/jira/browse/KAFKA-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056725#comment-17056725 ] Alexandre Dupriez commented on KAFKA-9702: -- 22,33,048 bytes - do you mean 22 MB? > Suspected memory leak > - > > Key: KAFKA-9702 > URL: https://issues.apache.org/jira/browse/KAFKA-9702 > Project: Kafka > Issue Type: Bug > Components: consumer >Affects Versions: 2.4.0 >Reporter: Nishant Ranjan >Priority: Major > > I am using Kafka consumer to fetch objects from Kafka topic and then persist > them in DB. > When I ran, eclipse MAT for memory leaks, its giving following : > 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *" loader>"* occupy *22,33,048 (31.01%)* bytes. > Also, my observation is that GC is not collecting objects. > Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9702) Suspected memory leak
Nishant Ranjan created KAFKA-9702: - Summary: Suspected memory leak Key: KAFKA-9702 URL: https://issues.apache.org/jira/browse/KAFKA-9702 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 2.4.0 Reporter: Nishant Ranjan I am using Kafka consumer to fetch objects from Kafka topic and then persist them in DB. When I ran, eclipse MAT for memory leaks, its giving following : 53 instances of *"java.util.zip.ZipFile$Source"*, loaded by *""* occupy *22,33,048 (31.01%)* bytes. Also, my observation is that GC is not collecting objects. Please let me know if more information is required. -- This message was sent by Atlassian Jira (v8.3.4#803005)