[jira] [Created] (KAFKA-12713) Report broker/consumer fetch latency more accurately
Ming Liu created KAFKA-12713: Summary: Report broker/consumer fetch latency more accurately Key: KAFKA-12713 URL: https://issues.apache.org/jira/browse/KAFKA-12713 Project: Kafka Issue Type: Bug Reporter: Ming Liu The fetch latency is an important metrics to monitor for the cluster performance. With ACK=ALL, the produce latency is affected primarily by broker fetch latency. However, currently the reported fetch latency didn't reflect the true fetch latency because it sometimes need to stay in purgatory and wait for replica.fetch.wait.max.ms when data is not available. This greatly affect the real P50, P99 etc. I like to propose a KIP to be able track the real fetch latency for both broker follower and consumer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10398) Intra-broker disk move failed with onPartitionFenced()
Ming Liu created KAFKA-10398: Summary: Intra-broker disk move failed with onPartitionFenced() Key: KAFKA-10398 URL: https://issues.apache.org/jira/browse/KAFKA-10398 Project: Kafka Issue Type: Bug Affects Versions: 2.5.0 Reporter: Ming Liu When I tried the intra-broker disk move on 2.5.0, it always failed quickly in onPartitionFenced() failure. That is all the log for ReplicaAlterLogManager: [2020-06-03 04:52:17,541] INFO [ReplicaAlterLogDirsManager on broker 5] Added fetcher to broker BrokerEndPoint(id=5, host=localhost:-1) for partitions Map(author_id_enrichment_changelog_staging-302 -> (offset=0, leaderEpoch=45)) (kafka.server.ReplicaAlterLogDirsManager) [2020-06-03 04:52:17,546] INFO [ReplicaAlterLogDirsThread-5]: Starting (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,546] INFO [ReplicaAlterLogDirsThread-5]: Truncating partition author_id_enrichment_changelog_staging-302 to local high watermark 0 (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,547] INFO [ReplicaAlterLogDirsThread-5]: Beginning/resuming copy of partition author_id_enrichment_changelog_staging-302 from offset 0. Including this partition, there are 1 remaining partitions to copy by this thread. (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,547] WARN [ReplicaAlterLogDirsThread-5]: Reset fetch offset for partition author_id_enrichment_changelog_staging-302 from 0 to current leader's start offset 1656927679 (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,550] INFO [ReplicaAlterLogDirsThread-5]: Current offset 0 for partition author_id_enrichment_changelog_staging-302 is out of range, which typically implies a leader change. Reset fetch offset to 1656927679 (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,653] INFO [ReplicaAlterLogDirsThread-5]: Partition author_id_enrichment_changelog_staging-302 has an older epoch (45) than the current leader. Will await the new LeaderAndIsr state before resuming fetching. (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,653] WARN [ReplicaAlterLogDirsThread-5]: Partition author_id_enrichment_changelog_staging-302 marked as failed (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,657] INFO [ReplicaAlterLogDirsThread-5]: Shutting down (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,661] INFO [ReplicaAlterLogDirsThread-5]: Stopped (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 04:52:17,661] INFO [ReplicaAlterLogDirsThread-5]: Shutdown completed (kafka.server.ReplicaAlterLogDirsThread) Only after restart the broker, the disk move succeed. The offset and epoch number looks better. [2020-06-03 05:20:12,597] INFO [ReplicaAlterLogDirsManager on broker 5] Added fetcher to broker BrokerEndPoint(id=5, host=localhost:-1) for partitions Map(author_id_enrichment_changelog_staging-302 -> (offset=166346, leaderEpoch=47)) (kafka.server.ReplicaAlterLogDirsManager) [2020-06-03 05:20:12,606] INFO [ReplicaAlterLogDirsThread-5]: Starting (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 05:20:12,618] INFO [ReplicaAlterLogDirsThread-5]: Beginning/resuming copy of partition author_id_enrichment_changelog_staging-302 from offset 1657605964. Including this partition, there are 1 remaining partitions to copy by this thread. (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 05:20:20,992] INFO [ReplicaAlterLogDirsThread-5]: Shutting down (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 05:20:20,994] INFO [ReplicaAlterLogDirsThread-5]: Shutdown completed (kafka.server.ReplicaAlterLogDirsThread) [2020-06-03 05:20:20,994] INFO [ReplicaAlterLogDirsThread-5]: Stopped (kafka.server.ReplicaAlterLogDirsThread) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10242) Adding metrics to track the total count of idempotent producers that Broker need to track
Ming Liu created KAFKA-10242: Summary: Adding metrics to track the total count of idempotent producers that Broker need to track Key: KAFKA-10242 URL: https://issues.apache.org/jira/browse/KAFKA-10242 Project: Kafka Issue Type: Improvement Components: producer Affects Versions: 2.5.0 Reporter: Ming Liu Fix For: 2.6.0 We found it is very useful to track the total number of idempotent producers that broker is tracking. In our production environment, we have many idempotent producers for a cluster and sometimes that number increased to very high number which requires some attention to mitigate. This is especially true for client (< 2.4) where the client retry might generate too many different idempotent producers which can trigger broker GC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-7698) Kafka Broker fail to start: ProducerFencedException thrown from producerstatemanager.scala!checkProducerEpoch
Ming Liu created KAFKA-7698: --- Summary: Kafka Broker fail to start: ProducerFencedException thrown from producerstatemanager.scala!checkProducerEpoch Key: KAFKA-7698 URL: https://issues.apache.org/jira/browse/KAFKA-7698 Project: Kafka Issue Type: Bug Reporter: Ming Liu During our operation of Kafka, we frequently saw this failure: There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.ProducerFencedException: {code:java} [06:57:09,697] INFO [ProducerStateManager partition=interaction_events-127] Loading producer state from snapshot file '/data/disk5/kafka/interaction_events-127/092130764817.snapshot' (kafka.log.ProducerStateManager) [06:57:09,698] INFO [Log partition=interaction_events-127, dir=/data/disk5/kafka] Completed load of log with 11 segments, log start offset 91975003024 and log end offset 92130764817 in 12701 ms (kafka.log.Log) [06:57:09,701] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 63 (request epoch), 66 (server epoch) (kafka.log.LogManager) [06:57:09,705] INFO [ProducerStateManager partition=client_interaction_events_authorid_enrichment-20] Writing producer snapshot at offset 92418754384 (kafka.log.ProducerStateManager) [06:57:09,707] ERROR [KafkaServer id=2] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no longer valid. There is probably another producer with a newer epoch. 63 (request epoch), 66 (server epoch) {code:java} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7692) updateFirstUnstableOffset NPE due to sequenceId overflow in ProducerStateManager.append
Ming Liu created KAFKA-7692: --- Summary: updateFirstUnstableOffset NPE due to sequenceId overflow in ProducerStateManager.append Key: KAFKA-7692 URL: https://issues.apache.org/jira/browse/KAFKA-7692 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.0.0 Reporter: Ming Liu Fix For: 2.2.0 When operating Kafka, we frequently saw this exception on Kafka server log, Exception: [2018-06-04 20:44:53,789] ERROR [KafkaServer id=19] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer) java.lang.NullPointerException at kafka.log.Log.kafka$log$Log$$updateFirstUnstableOffset(Log.scala:792) at kafka.log.Log.kafka$log$Log$$loadProducerState(Log.scala:518) at kafka.log.Log.(Log.scala:228) at kafka.log.Log$.apply(Log.scala:1747) at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:255) at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:335) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7693) "IllegalArgumentException: Invalid negative sequence number used" in Kafka Client
Ming Liu created KAFKA-7693: --- Summary: "IllegalArgumentException: Invalid negative sequence number used" in Kafka Client Key: KAFKA-7693 URL: https://issues.apache.org/jira/browse/KAFKA-7693 Project: Kafka Issue Type: Bug Components: clients Reporter: Ming Liu When operating Kafka, we have seen Kafka client logging this kind of exception: org.apache.kafka.clients.producer.internals.Sender - [Producer clientId=interaction-counter-service-2-dev-by-tweet-id-counting-dest-producer, transactionalId=interaction-counter-service-2-dev-by-tweet-id-counting-81-transactional-id-gen-1] Uncaught error in kafka producer I/O thread: java.lang.IllegalArgumentException: Invalid negative sequence number used at org.apache.kafka.common.record.MemoryRecordsBuilder.validateProducerState(MemoryRecordsBuilder.java:331) at org.apache.kafka.common.record.MemoryRecordsBuilder.close(MemoryRecordsBuilder.java:302) at org.apache.kafka.clients.producer.internals.ProducerBatch.close(ProducerBatch.java:407) at org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:572) at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:270) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005)