Re: message loss for sync producer, acks=2, topic replicas=3

Jiang Wu (Pricehistory) (BLOOMBERG/ 731 LEX -) Wed, 16 Jul 2014 05:45:13 -0700

Guozhong,

So this is the cause of message loss in my test where acks=2 and replicas=3:
At one moment all 3 replicas, leader L, followers F1 and F2 are in ISR. A 
publisher sends a message m to L. F1 fetches m. Both L and F1 acknowledge m so 
the send() is successful. Before F2 fetches m, L is killed and leader election 
takes place, and F2 is selected as the new leader. After F2 becomes the leader, 
it doesn't replicate m from F1, so consumers won't receive the message m.


It seems to me that the election here is an unclean leader election that can be 
avoided. For example, instead of just choosing the first live broker in the ISR 
as the new leader, choosing the one fetched more messages as the new leader may 
avoid the message loss in the above scenario. Is this a feasible fix?

Thanks,
Jiang 

----- Original Message -----
From: [email protected]
To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -)
At: Jul 15 2014 16:30:56

That is true: when broker becomes a new leader it will stop replicating
data from others. However, what you may want to do is tune the following
configs so that replicas will not be easily dropping out of ISR under high
produce load:

replica.lag.max.messages

replica.lag.time.max.ms

You can get their description here:

http://kafka.apache.org/documentation.html#brokerconfigs

Guozhang


On Tue, Jul 15, 2014 at 1:25 PM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731
LEX -) <[email protected]> wrote:

> When ack=-1 and the publisher thread number is high, it always happens
> that only the leader remains in ISR and shutting down the leader will cause
> message loss.
>
> The leader election code shows that the new leader will be the first alive
> broker in the ISR list. So it's possible the new leader will be behind the
> followers.
>
> It seems that after a broker becomes a leader, it stops replicating from
> others even when it hasn't received all available messages?
>
> Regards,
> Jiang
>
> ----- Original Message -----
> From: [email protected]
> To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), [email protected]
> At: Jul 15 2014 16:11:17
>
> That could be the cause, and it can be verified by changing the acks to -1
> and checking the data loss ratio then.
>
> Guozhang
>
>
> On Tue, Jul 15, 2014 at 12:49 PM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731
> LEX -) <[email protected]> wrote:
>
> > Guozhang,My coworker came up with an explaination: at one moment the
> > leader L, and two followers F1, F2 are all in ISR. The producer sends a
> > message m1 and receives acks from L and F1. Before the messge is
> replicated
> > to F2, L is down. In the following leader election, F2, instead of F1,
> > becomes the leader, and loses m1 somehow.
> > Could that be the root cause?
> > Thanks,
> > Jiang
> >
> > From: [email protected] At: Jul 15 2014 15:05:25
> > To: [email protected]
> > Subject: Re: message loss for sync producer, acks=2, topic replicas=3
> >
> > Guozhang,
> >
> > Please find the config below:
> >
> > Producer:
> >
> >    props.put("producer.type", "sync");
> >
> >    props.put("request.required.acks", 2);
> >
> >    props.put("serializer.class", "kafka.serializer.StringEncoder");
> >
> >    props.put("partitioner.class", "kafka.producer.DefaultPartitioner");
> >
> >    props.put("message.send.max.retries", "60");
> >
> >    props.put("retry.backoff.ms", "300");
> >
> > Consumer:
> >
> >    props.put("zookeeper.session.timeout.ms", "400");
> >
> >    props.put("zookeeper.sync.time.ms", "200");
> >
> >    props.put("auto.commit.interval.ms", "1000");
> >
> > Broker:
> > num.network.threads=2
> > num.io.threads=8
> > socket.send.buffer.bytes=1048576
> > socket.receive.buffer.bytes=1048576
> > socket.request.max.bytes=104857600
> > num.partitions=2
> > log.retention.hours=168
> > log.retention.bytes=20000000
> > log.segment.bytes=536870912
> > log.retention.check.interval.ms=60000
> > log.cleaner.enable=false
> > zookeeper.connection.timeout.ms=1000000
> >
> > Topic:
> > Topic:p1r3      PartitionCount:1        ReplicationFactor:3
> > Configs:retention.bytes=10000000000
> >
> > Thanks,
> > Jiang
> >
> > From: [email protected] At: Jul 15 2014 13:59:03
> > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -),
> [email protected]
> > Subject: Re: message loss for sync producer, acks=2, topic replicas=3
> >
> > What config property values did you use on producer/consumer/broker?
> >
> > Guozhang
> >
> >
> > On Tue, Jul 15, 2014 at 10:32 AM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731
> > LEX -) <[email protected]> wrote:
> >
> > > Guozhang,
> > > I'm testing on 0.8.1.1; just kill pid, no -9.
> > > Regards,
> > > Jiang
> > >
> > > From: [email protected] At: Jul 15 2014 13:27:50
> > > To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -),
> > [email protected]
> > > Subject: Re: message loss for sync producer, acks=2, topic replicas=3
> > >
> > > Hello Jiang,
> > >
> > > Which version of Kafka are you using, and did you kill the broker with
> > -9?
> > >
> > > Guozhang
> > >
> > >
> > > On Tue, Jul 15, 2014 at 9:23 AM, Jiang Wu (Pricehistory) (BLOOMBERG/
> 731
> > > LEX -) <[email protected]> wrote:
> > >
> > > > Hi,
> > > > I observed some unexpected message loss in kafka fault tolerant test.
> > In
> > > > the test, a topic with 3 replicas is created. A sync producer with
> > acks=2
> > > > publishes to the topic. A consumer consumes from the topic and tracks
> > > > message ids. During the test, the leader is killed. Both producer and
> > > > consumer continue to run for a while. After the producer stops, the
> > > > consumer reports if all messages are received.
> > > >
> > > > The test was repeated multiple rounds; message loss happened in about
> > 10%
> > > > of the tests. A typical scenario is as follows: before the leader is
> > > > killed, all 3 replicas are in ISR. After the leader is killed, one
> > > follower
> > > > becomes the leader, and 2 replicas (including the new leader) are in
> > ISR.
> > > > Both the producer and consumer pause for several seconds during that
> > > time,
> > > > and then continue. Message loss happens after the leader is killed.
> > > >
> > > > Because the new leader is in ISR before the old leader is killed,
> > unclean
> > > > leader election doesn't explain the message loss.
> > > >
> > > > I'm wondering if anyone else also observed such message loss? Is
> there
> > > any
> > > > known issue that may cause the message loss in the above scenario?
> > > >
> > > > Thanks,
> > > > Jiang
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
> >
> >
>
>
> --
> -- Guozhang
>
>


-- 
-- Guozhang

Re: message loss for sync producer, acks=2, topic replicas=3

Reply via email to