Hi,
I observed some unexpected message loss in kafka fault tolerant test. In the 
test, a topic with 3 replicas is created. A sync producer with acks=2 publishes 
to the topic. A consumer consumes from the topic and tracks message ids. During 
the test, the leader is killed. Both producer and consumer continue to run for 
a while. After the producer stops, the consumer reports if all messages are 
received.

The test was repeated multiple rounds; message loss happened in about 10% of 
the tests. A typical scenario is as follows: before the leader is killed, all 3 
replicas are in ISR. After the leader is killed, one follower becomes the 
leader, and 2 replicas (including the new leader) are in ISR. Both the producer 
and consumer pause for several seconds during that time, and then continue. 
Message loss happens after the leader is killed.

Because the new leader is in ISR before the old leader is killed, unclean 
leader election doesn't explain the message loss.

I'm wondering if anyone else also observed such message loss? Is there any 
known issue that may cause the message loss in the above scenario? 

Thanks,
Jiang

Reply via email to