Re: Investigating apparent data loss during preferred replica election

Mark Smith Mon, 21 Nov 2016 17:21:07 -0800

Jun,



Yeah, I probably have an off-by-one issue in the HW description. I
think you could pick any number here and the problem remains -- could
you read through the steps I posted and see if they logically make
sense, numbers aside?

We definitely lost data in 4 partitions of the 8,000 that were elected,
and there was only a single election for each partition. We had done a
rolling restart hours before, but that had been done for a long time and
everything was stable. We do not allow automatic election, it's a manual
process that we initiate after the cluster has stabilized.


So in this case, I still don't think any discussion about multiple-
failovers is germane to the problem we saw. Each of our partitions only
had a single failover, and yet 4 of them still truncated committed data.


--

Mark Smith

m...@qq.is





On Mon, Nov 21, 2016, at 05:12 PM, Jun Rao wrote:

> Hi, Mark,

> 

> Hmm, the committing of a message at offset X is the equivalent of
> saying that the HW is at offset X + 1. So, in your example, if the
> producer publishes a new message at offset 37, this message won't be
> committed (i.e., HW moves to offset 38) until the leader sees the
> follower fetch from offset 38 (not offset 37). At that point, the
> follower would have received message at offset 37 in the fetch
> response and appended that message to its local log. If the follower
> now becomes the new leader, message at offset 37 is preserved.
> 

> The problem that I described regarding data loss can happen during a
> rolling restart. Suppose that you have 3 replicas A, B, and C. Let's
> say A is the preferred the leader, but during the deployment, the
> leader gets moved to replica B at some point and all 3 replicas are in
> sync. A new message is produced at offset 37 and is committed
> (leader's HW =38). However, the HW in replica A is still at 37. Now,
> we try to shutdown broker B and the leader gets moved to replica C.
> Replica A starts to follow replica C and it first truncates to HW 37,
> which removes the message at offset 37. Now, preferred leader logic
> kicks in and the leadership switches again to replica A. Since A
> doesn't have message at offset 37 any more and all followers copy
> messages from replica A, message at offset 37 is lost.
> 

> With KAFKA-3670, in the above example, when shutting down broker B,
> the leader will be directly moved to replica A since it's a preferred
> replica. So the above scenario won't happen.
> 

> The more complete fix is in KAFKA-1211. The logic for getting the
> latest generation snapshot is just a proposal and is not in the code
> base yet.
> 

> Thanks,

> 

> Jun

> 

> On Mon, Nov 21, 2016 at 3:20 PM, Mark Smith <m...@qq.is> wrote:

>> Jun,

>> 

>>  Thanks for the reply!

>> 

>>  I am aware the HW won't advance until the in-sync replicas have

>>  _requested_ the messages. However, I believe the issue is that the

>>  leader has no guarantee the replicas have _received_ the fetch
>>  response.
>>  There is no second-phase to the commit.

>> 

>>  So, in the particular case where a leader transition happens, I
>>  believe
>>  this race condition exists (and I'm happy to be wrong here, but
>>  it looks
>>  feasible and explains the data loss I saw):

>> 

>>  1. Starting point: Leader and Replica both only have up to message
>>     #36
>>  2. Client produces new message with required.acks=all

>>  3. Leader commits message #37, but HW is still #36, the produce
>>     request
>>  is blocked

>>  4. Replica fetches messages (leader has RECEIVED the fetch request)
>>  5. Leader then advances HW to #37 and unblocks the produce request,
>>  client believes it's durable

>>  6. PREFERRED REPLICA ELECTION BEGIN

>>  7. Replica starts become-leader process

>>  8. Leader finishes sending fetch response, replica is just now
>>     seeing
>>  message #37

>>  9. Replica throws away fetch response from step 4 because it is now
>>  becoming leader (partition has been removed from partitionMap so it
>>  looks like data is ignored)

>>  10. Leader starts become-follower

>>  11. Leader truncates to replica HW offset of #36

>>  12. Message #37 was durably committed but is now lost

>> 

>>  For the tickets you linked:

>> 

>> https://issues.apache.org/jira/browse/KAFKA-3670

>>  * There was no shutdown involved in this case, so this shouldn't be
>>  impacting.

>> 

>> https://issues.apache.org/jira/browse/KAFKA-1211

>>  * I've read through this but I'm not entirely sure if it
>>    addresses the
>>  above. I don't think it does, though. I don't see a step in the
>>  ticket
>>  about become-leader making a call to the old leader to get the
>>  latest
>>  generation snapshot?

>> 
>>  --
>>  Mark Smith
>> m...@qq.is

>> 

>> On Fri, Nov 18, 2016, at 10:52 AM, Jun Rao wrote:

>>  > Mark,

>>  >

>>  > Thanks for reporting this. First, a clarification. The HW is
>>  > actually
>>  > never

>>  > advanced until all in-sync followers have fetched the
>>  > corresponding
>>  > message. For example, in step 2, if all follower replicas issue a
>>  > fetch
>>  > request at offset 10, it serves as an indication that all replicas
>>  > have
>>  > received messages up to offset 9. So,only then, the HW is
>>  > advanced to
>>  > offset 10 (which is not inclusive).

>>  >

>>  > I think the problem that you are seeing are probably caused by two
>>  > known
>>  > issues. The first one is

>>  > https://issues.apache.org/jira/browse/KAFKA-1211.

>>  > The issue is that the HW is propagated asynchronously from the
>>  > leader to
>>  > the followers. If the leadership changes multiple time very
>>  > quickly, what
>>  > can happen is that a follower first truncates its data up to HW
>>  > and then
>>  > immediately becomes the new leader. Since the follower's HW may
>>  > not be up
>>  > to date, some previously committed messages could be lost. The
>>  > second one
>>  > is https://issues.apache.org/jira/browse/KAFKA-3670. The issue is
>>  > that
>>  > controlled shutdown and leader balancing can cause leadership to
>>  > change
>>  > more than once quickly, which could expose the data loss problem
>>  > in the
>>  > first issue.

>>  >

>>  > The second issue has been fixed in 0.10.0. So, if you upgrade to
>>  > that
>>  > version or above, it should reduce the chance of hitting the first
>>  > issue
>>  > significantly. We are actively working on the first issue and
>>  > hopefully
>>  > it

>>  > will be addressed in the next release.

>>  >

>>  > Jun

>>  >

>>  > On Thu, Nov 17, 2016 at 5:39 PM, Mark Smith <m...@qq.is> wrote:

>>  >

>>  > > Hey folks,

>>  > >

>>  > > I work at Dropbox and I was doing some maintenance yesterday and
>>  > > it
>>  > > looks like we lost some committed data during a preferred
>>  > > replica
>>  > > election. As far as I understand this shouldn't happen, but I
>>  > > have a
>>  > > theory and want to run it by ya'll.

>>  > >

>>  > > Preamble:

>>  > > * Kafka 0.9.0.1

>>  > > * required.acks = -1 (All)

>>  > > * min.insync.replicas = 2 (only 2 replicas for the partition, so
>>  > >   we
>>  > > require both to have the data)

>>  > > * consumer is Kafka Connect

>>  > > * 1400 topics, total of about 15,000 partitions

>>  > > * 30 brokers

>>  > >

>>  > > I was performing some rolling restarts of brokers yesterday as
>>  > > part of
>>  > > our regular DRT (disaster testing) process and at the end that
>>  > > always
>>  > > leaves many partitions that need to be failed back to the
>>  > > preferred
>>  > > replica. There were about 8,000 partitions that needed moving. I
>>  > > started
>>  > > the election in Kafka Manager and it worked, but it looks like 4
>>  > > of
>>  > > those 8,000 partitions experienced some relatively small amount
>>  > > of data
>>  > > loss at the tail.

>>  > >

>>  > > From the Kafka Connect point of view, we saw a handful of these:
>>  > >

>>  > > [2016-11-17 02:55:26,513] [WorkerSinkTask-clogger-analytics-staging-8-
>>  > > 5]
>>  > > INFO Fetch offset 67614479952 is out of range, resetting offset
>>  > > (o.a.k.c.c.i.Fetcher:595)

>>  > >

>>  > > I believe that was because it asked the new leader for data and
>>  > > the new
>>  > > leader had less data than the old leader. Indeed, the old leader
>>  > > became
>>  > > a follower and immediately truncated:

>>  > >

>>  > > 2016-11-17 02:55:27,237 INFO log.Log: Truncating log

>>  > > goscribe.client-host_activity-21 to offset 67614479601.

>>  > >

>>  > > Given the above production settings I don't know why KC would
>>  > > ever see
>>  > > an OffsetOutOfRange error but this caused KC to reset to the
>>  > > beginning
>>  > > of the partition. Various broker logs for the failover paint the
>>  > > following timeline:

>>  > > https://gist.github.com/zorkian/d80a4eb288d40c1ee7fb5d2d340986d6
>>  > >

>>  > > My current working theory that I'd love eyes on:

>>  > >

>>  > >   1. Leader receives produce request and appends to log,
>>  > >      incrementing
>>  > >   LEO, but given the durability requirements the HW is not
>>  > >   incremented
>>  > >   and the produce response is delayed (normal)

>>  > >

>>  > >   2. Replica sends Fetch request to leader as part of normal
>>  > >      replication
>>  > >   flow

>>  > >

>>  > >   3. Leader increments HW when it STARTS to respond to the Fetch
>>  > >      request
>>  > >   (per fetchMessages in ReplicaManager.scala), so the HW is
>>  > >   updated as
>>  > >   soon as we've prepared messages for response -- importantly
>>  > >   the HW is
>>  > >   updated even though the replica has not yet actually seen the
>>  > >   messages, even given the durability settings we've got

>>  > >

>>  > >   4. Meanwhile, Kafka Connect sends Fetch request to leader and
>>  > >      receives
>>  > >   the messages below the new HW, but the messages have not
>>  > >   actually been
>>  > >   received by the replica yet still

>>  > >

>>  > >   5. Preferred replica election begins (oh the travesty!)

>>  > >

>>  > >   6. Replica starts the become-leader process and makeLeader
>>  > >      removes
>>  > >   this partition from partitionMap, which means when the
>>  > >   response comes
>>  > >   in finally, we ignore it (we discard the old-leader committed
>>  > >   messages)

>>  > >

>>  > >   7. Old-leader starts become-follower process and truncates to
>>  > >      the HW
>>  > >   of the new-leader i.e. the old-leader has now thrown away data
>>  > >   it had
>>  > >   committed and given out moments ago

>>  > >

>>  > >   8. Kafka Connect sends Fetch request to the new-leader but its
>>  > >      offset
>>  > >   is now greater than the HW of the new-leader, so we get the

>>  > >   OffsetOutOfRange error and restart

>>  > >

>>  > > Can someone tell me whether or not this is plausible? If it is,
>>  > > is there
>>  > > a known issue/bug filed for it? I'm not exactly sure what the
>>  > > solution
>>  > > is, but it does seem unfortunate that a normal operation (leader
>>  > > election with both brokers alive and well) can result in the
>>  > > loss of
>>  > > committed messages.

>>  > >

>>  > > And, if my theory doesn't hold, can anybody explain what
>>  > > happened? I'm
>>  > > happy to provide more logs or whatever.

>>  > >

>>  > > Thanks!

>>  > >

>>  > >

>>  > > --

>>  > > Mark Smith

>>  > > m...@qq.is

>>  > >

Re: Investigating apparent data loss during preferred replica election

Reply via email to