Jun,
There're still other concerns regarding ack=-1. A single disk failure may cause 
data loss for ack=-1. When 2 out 3 brokers fail out of ISR, acknowledged 
messages may be stored in the leader only. If the leader disk failure happens, 
then these messages are lost. In a less severe situtation where the leader is 
temporarily unavailable, message loss or service blocking may happen, depending 
on whether unclean leader election is enabled.

I didn't see message loss in my tests in case 3.1 where 2 brokers are restarted 
in sequence. I had though it would happen because lagged broker could join in 
ISR, but as you explained it's not the case. Thanks for your clarification.

I filed a jira to record the discussion on this topic 
https://issues.apache.org/jira/i#browse/KAFKA-1555.

Regards,
Jiang

----- Original Message -----
From: jun...@gmail.com
To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), users@kafka.apache.org
At: Jul 22 2014 17:59:55

Jiang,

In case 3.1, when C restarts, the protocol is that C can only join ISR if
it has received all messages up to the current high watermark.

For example, let's assume that M is 10. Let's say A, B, C all have messages
at offset 100 and all those messages are committed (therefore high
watermark is at 100). Then C dies. After that, we commit 5 more messages
with both A and B (high watermark is at 105). Now, C is restarted. C is
actually not allowed to rejoin ISR until its log end offset has passed 105.
This means that C must first fetch the 5 newly committed messages before
being added to ISR.

Are you observing data loss in this case?

Thanks,

Jun


On Tue, Jul 22, 2014 at 6:21 AM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731
LEX -) <jwu...@bloomberg.net> wrote:

> kafka-1028 addressed another unclean leader election problem. It prevents
> a broker not in ISR from becoming a leader. The problem we are facing is
> that a broker in ISR but without complete messages may become a leader.
> It's also a kind of unclean leader election, but not the one that
> kafka-1028 addressed.
>
> Here I'm trying to give a proof that current kafka doesn't achieve the
> requirement (no message loss, no blocking when 1 broker down) due to its
> two behaviors:
> 1. when choosing a new leader from 2 followers in ISR, the one with less
> messages may be chosen as the leader
> 2. even when replica.lag.max.messages=0, a follower can stay in ISR when
> it has less messages than the leader.
>
> We consider a cluster with 3 brokers and a topic with 3 replicas. We
> analyze different cases according to the value of request.required.acks
> (acks for short). For each case and it subcases, we find situations that
> either message loss or service blocking happens. We assume that at the
> beginning, all 3 replicas, leader A, followers B and C, are in sync, i.e.,
> they have the same messages and are all in ISR.
>
> 1. acks=0, 1, 3. Obviously these settings do not satisfy the requirement.
> 2. acks=2. Producer sends a message m. It's acknowledged by A and B. At
> this time, although C hasn't received m, it's still in ISR. If A is killed,
> C can be elected as the new leader, and consumers will miss m.
> 3. acks=-1. Suppose replica.lag.max.messages=M. There are two sub-cases:
> 3.1 M>0. Suppose C be killed. C will be out of ISR after
> replica.lag.time.max.ms. Then the producer publishes M messages to A and
> B. C restarts. C will join in ISR since it is M messages behind A and B.
> Before C replicates all messages, A is killed, and C becomes leader, then
> message loss happens.
> 3.2 M=0. In this case, when the producer publishes at a high speed, B and
> C will fail out of ISR, only A keeps receiving messages. Then A is killed.
> Either message loss or service blocking will happen, depending on whether
> unclean leader election is enabled.
>
>
> From: users@kafka.apache.org At: Jul 21 2014 22:28:18
> To: JIANG WU (PRICEHISTORY) (BLOOMBERG/ 731 LEX -), users@kafka.apache.org
> Subject: Re: how to ensure strong consistency with reasonable availability
>
> You will probably need 0.8.2  which gives
> https://issues.apache.org/jira/browse/KAFKA-1028
>
>
> On Mon, Jul 21, 2014 at 6:37 PM, Jiang Wu (Pricehistory) (BLOOMBERG/ 731
> LEX -) <jwu...@bloomberg.net> wrote:
>
> > Hi everyone,
> >
> > With a cluster of 3 brokers and a topic of 3 replicas, we want to achieve
> > the following two properties:
> > 1. when only one broker is down, there's no message loss, and
> > procuders/consumers are not blocked.
> > 2. in other more serious problems, for example, one broker is restarted
> > twice in a short period or two brokers are down at the same time,
> > producers/consumers can be blocked, but no message loss is allowed.
> >
> > We haven't found any producer/broker paramter combinations that achieve
> > this. If you know or think some configurations will work, please post
> > details. We have a test bed to verify any given configurations.
> >
> > In addition, I'm wondering if it's necessary to open a jira to require
> the
> > above feature?
> >
> > Thanks,
> > Jiang
>
>
>

Reply via email to