Problem Description

When using a consumer created with librdkafka to receive messages from Kafka, 
intermittent message latency issues are observed. The time difference between 
message receipt and the timestamp in the message body exceeds 1 second, 
although most messages are received within about 10ms.

Environment Information
Software Versions
librdkafka version: 2.11.0
Operating System: CentOS 7.6
Kafka version: 3.6.2 (zookeeper mode deployment)
Kafka Cluster
Number of nodes: 3 nodes
Server configuration: 64 vCPU, 128GB RAM
Network: Gigabit network, connected to the same switch, low network latency
disk: HDD RAID1
Topic Configuration

Test Topic (test):

Partitions: 1
Replicas: 2
message.timestamp.type=LogAppendTime
min.insync.replicas=1

Load Topics (testA, testB, testC, testD):

Each topic: 128 partitions, 2 replicas
Total message rate: 80,000 messages/second (20,000 messages/second per topic)
Message size: 500 bytes per message
Consumer Configuration (librdkafka)
fetch.wait.max.ms: 10 (500 still have this issue,so i change to 10)
All other configurations are librdkafka defaults
Reproduction Steps
Create four load topics (testA, testB, testC, testD), each with 128 partitions 
and 2 replicas
Deploy test programs to send 20,000 messages per second (500 bytes each) to 
each of the four load topics
Create test topic test with the configuration mentioned above
Use a test program to send 1 message per second (100 bytes) to the test topic
Create a consumer that subscribes to the test topic
The consumer prints the received message time and the timestamp in the message
Observe that most messages are received within about 10ms, but occasionally 
messages are delayed by more than 1 second
Key Observations
The test consumer program runs on the partition leader node of the test topic 
(eliminating node clock differences)
Intermittent latency occurs under high load (80k msg/s)
Low-throughput topic (1 msg/s) experiences delays in a high-throughput 
background
Latency is intermittent, not continuous
Using librdkafka version 2.11.0, CentOS 7.6 operating system

I used tcpdump to capture network packets and observed that the consumer 
frequently initiates fetch request requests, and Kafka's fetch responses are 
also very fast. However, it requires multiple requests and responses before the 
message can be received, which is the main source of the delay.

Perhaps this is not an issue with librdkafka. I set log.cleaner.threads=4 and 
num.replica.fetchers=4 on Kafka, but the problem still persists. After 
upgrading Kafka to version 4.0 with Kraft deployment and following the same 
test steps, the delay issue still exists, but the frequency is much lower, with 
delays around 500ms.

Does anyone have new directions to suggest for further troubleshooting this 
issue?





| |
杜杰
|

Reply via email to