[jira] [Created] (KAFKA-12713) Report broker/consumer fetch latency more accurately

2021-04-24 Thread Ming Liu (Jira)
Ming Liu created KAFKA-12713:


 Summary: Report broker/consumer fetch latency more accurately
 Key: KAFKA-12713
 URL: https://issues.apache.org/jira/browse/KAFKA-12713
 Project: Kafka
  Issue Type: Bug
Reporter: Ming Liu


The fetch latency is an important metrics to monitor for the cluster 
performance. With ACK=ALL, the produce latency is affected primarily by broker 
fetch latency.

However, currently the reported fetch latency didn't reflect the true fetch 
latency because it sometimes need to stay in purgatory and wait for 
replica.fetch.wait.max.ms when data is not available. This greatly affect the 
real P50, P99 etc. 



I like to propose a KIP to be able track the real fetch latency for both broker 
follower and consumer. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10398) Intra-broker disk move failed with onPartitionFenced()

2020-08-13 Thread Ming Liu (Jira)
Ming Liu created KAFKA-10398:


 Summary: Intra-broker disk move failed with onPartitionFenced()
 Key: KAFKA-10398
 URL: https://issues.apache.org/jira/browse/KAFKA-10398
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Ming Liu


When I tried the intra-broker disk move on 2.5.0, it always failed quickly in 
onPartitionFenced() failure. That is all the log for ReplicaAlterLogManager:
[2020-06-03 04:52:17,541] INFO [ReplicaAlterLogDirsManager on broker 5] Added 
fetcher to broker BrokerEndPoint(id=5, host=localhost:-1) for partitions 
Map(author_id_enrichment_changelog_staging-302 -> (offset=0, leaderEpoch=45)) 
(kafka.server.ReplicaAlterLogDirsManager)
[2020-06-03 04:52:17,546] INFO [ReplicaAlterLogDirsThread-5]: Starting 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,546] INFO [ReplicaAlterLogDirsThread-5]: Truncating 
partition author_id_enrichment_changelog_staging-302 to local high watermark 0 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,547] INFO [ReplicaAlterLogDirsThread-5]: 
Beginning/resuming copy of partition author_id_enrichment_changelog_staging-302 
from offset 0. Including this partition, there are 1 remaining partitions to 
copy by this thread. (kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,547] WARN [ReplicaAlterLogDirsThread-5]: Reset fetch 
offset for partition author_id_enrichment_changelog_staging-302 from 0 to 
current leader's start offset 1656927679 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,550] INFO [ReplicaAlterLogDirsThread-5]: Current offset 0 
for partition author_id_enrichment_changelog_staging-302 is out of range, which 
typically implies a leader change. Reset fetch offset to 1656927679 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,653] INFO [ReplicaAlterLogDirsThread-5]: Partition 
author_id_enrichment_changelog_staging-302 has an older epoch (45) than the 
current leader. Will await the new LeaderAndIsr state before resuming fetching. 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,653] WARN [ReplicaAlterLogDirsThread-5]: Partition 
author_id_enrichment_changelog_staging-302 marked as failed 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,657] INFO [ReplicaAlterLogDirsThread-5]: Shutting down 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,661] INFO [ReplicaAlterLogDirsThread-5]: Stopped 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 04:52:17,661] INFO [ReplicaAlterLogDirsThread-5]: Shutdown 
completed (kafka.server.ReplicaAlterLogDirsThread)
Only after restart the broker, the disk move succeed. The offset and epoch 
number looks better.
[2020-06-03 05:20:12,597] INFO [ReplicaAlterLogDirsManager on broker 5] Added 
fetcher to broker BrokerEndPoint(id=5, host=localhost:-1) for partitions 
Map(author_id_enrichment_changelog_staging-302 -> (offset=166346, 
leaderEpoch=47)) (kafka.server.ReplicaAlterLogDirsManager)
[2020-06-03 05:20:12,606] INFO [ReplicaAlterLogDirsThread-5]: Starting 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 05:20:12,618] INFO [ReplicaAlterLogDirsThread-5]: 
Beginning/resuming copy of partition author_id_enrichment_changelog_staging-302 
from offset 1657605964. Including this partition, there are 1 remaining 
partitions to copy by this thread. (kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 05:20:20,992] INFO [ReplicaAlterLogDirsThread-5]: Shutting down 
(kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 05:20:20,994] INFO [ReplicaAlterLogDirsThread-5]: Shutdown 
completed (kafka.server.ReplicaAlterLogDirsThread)
[2020-06-03 05:20:20,994] INFO [ReplicaAlterLogDirsThread-5]: Stopped 
(kafka.server.ReplicaAlterLogDirsThread)
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10242) Adding metrics to track the total count of idempotent producers that Broker need to track

2020-07-06 Thread Ming Liu (Jira)
Ming Liu created KAFKA-10242:


 Summary: Adding metrics to track the total count of idempotent 
producers that Broker need to track
 Key: KAFKA-10242
 URL: https://issues.apache.org/jira/browse/KAFKA-10242
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Affects Versions: 2.5.0
Reporter: Ming Liu
 Fix For: 2.6.0


We found it is very useful to track the total number of idempotent producers 
that broker is tracking.
In our production environment, we have many idempotent producers for a cluster 
and sometimes that number increased to very high number which requires some 
attention to mitigate.
This is especially true for client (< 2.4) where the client retry might 
generate too many different idempotent producers which can trigger broker GC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-7698) Kafka Broker fail to start: ProducerFencedException thrown from producerstatemanager.scala!checkProducerEpoch

2018-12-03 Thread Ming Liu (JIRA)
Ming Liu created KAFKA-7698:
---

 Summary: Kafka Broker fail to start: ProducerFencedException 
thrown from producerstatemanager.scala!checkProducerEpoch 
 Key: KAFKA-7698
 URL: https://issues.apache.org/jira/browse/KAFKA-7698
 Project: Kafka
  Issue Type: Bug
Reporter: Ming Liu


During our operation of Kafka, we frequently saw this failure: 

   There was an error in one of the threads during logs loading: 
org.apache.kafka.common.errors.ProducerFencedException:

{code:java}

[06:57:09,697] INFO [ProducerStateManager partition=interaction_events-127] 
Loading producer state from snapshot file 
'/data/disk5/kafka/interaction_events-127/092130764817.snapshot' 
(kafka.log.ProducerStateManager)
[06:57:09,698] INFO [Log partition=interaction_events-127, 
dir=/data/disk5/kafka] Completed load of log with 11 segments, log start offset 
91975003024 and log end offset 92130764817 in 12701 ms (kafka.log.Log)
[06:57:09,701] ERROR There was an error in one of the threads during logs 
loading: org.apache.kafka.common.errors.ProducerFencedException: Producer's 
epoch is no longer valid. There is probably another producer with a newer 
epoch. 63 (request epoch), 66 (server epoch) (kafka.log.LogManager)
[06:57:09,705] INFO [ProducerStateManager 
partition=client_interaction_events_authorid_enrichment-20] Writing producer 
snapshot at offset 92418754384 (kafka.log.ProducerStateManager)
[06:57:09,707] ERROR [KafkaServer id=2] Fatal error during KafkaServer startup. 
Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.errors.ProducerFencedException: Producer's epoch is no 
longer valid. There is probably another producer with a newer epoch. 63 
(request epoch), 66 (server epoch)
{code:java}
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7692) updateFirstUnstableOffset NPE due to sequenceId overflow in ProducerStateManager.append

2018-12-02 Thread Ming Liu (JIRA)
Ming Liu created KAFKA-7692:
---

 Summary: updateFirstUnstableOffset NPE due to sequenceId overflow 
in ProducerStateManager.append
 Key: KAFKA-7692
 URL: https://issues.apache.org/jira/browse/KAFKA-7692
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Ming Liu
 Fix For: 2.2.0


When operating Kafka, we frequently saw this exception on Kafka server log, 

Exception: 

[2018-06-04 20:44:53,789] ERROR [KafkaServer id=19] Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.NullPointerException
    at kafka.log.Log.kafka$log$Log$$updateFirstUnstableOffset(Log.scala:792)
    at kafka.log.Log.kafka$log$Log$$loadProducerState(Log.scala:518)
    at kafka.log.Log.(Log.scala:228)
    at kafka.log.Log$.apply(Log.scala:1747)
    at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:255)
    at 
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:335)
    at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7693) "IllegalArgumentException: Invalid negative sequence number used" in Kafka Client

2018-12-02 Thread Ming Liu (JIRA)
Ming Liu created KAFKA-7693:
---

 Summary: "IllegalArgumentException: Invalid negative sequence 
number used" in Kafka Client
 Key: KAFKA-7693
 URL: https://issues.apache.org/jira/browse/KAFKA-7693
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Ming Liu


When operating Kafka, we have seen Kafka client logging this kind of exception:

org.apache.kafka.clients.producer.internals.Sender  -   
   [Producer 
clientId=interaction-counter-service-2-dev-by-tweet-id-counting-dest-producer, 
transactionalId=interaction-counter-service-2-dev-by-tweet-id-counting-81-transactional-id-gen-1]
 Uncaught error in kafka producer I/O thread:

java.lang.IllegalArgumentException: Invalid negative sequence number used

    at 
org.apache.kafka.common.record.MemoryRecordsBuilder.validateProducerState(MemoryRecordsBuilder.java:331)

    at 
org.apache.kafka.common.record.MemoryRecordsBuilder.close(MemoryRecordsBuilder.java:302)

    at 
org.apache.kafka.clients.producer.internals.ProducerBatch.close(ProducerBatch.java:407)

    at 
org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:572)

    at 
org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:270)

    at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)

    at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)

    at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)