I'm preparing a longer post here, but we recently ran into a similar
scenario.  Not sure yet if it's the same thing you saw (but it feels
similar).  We were also doing a rolling upgrade from 0.8.1.1 to 0.8.2.1,
and during the controlled shutdown of the first node (of a 4 node cluster),
the controlled shutdown was taking longer than normal (it timed out several
times and was retrying controlled shutdown), and unfortunately, our
deployment system decided to kill it hard (so it was in the middle of it's
4th controlled shutdown retry, etc.).

Anyway, when the node came back, it naturally decided to 'restore' most of
it's partitions, which took some time (but only like 5 minutes).  What's
weird is it didn't decide to resync data from other replicas, instead it
just restored partitions locally.  During this time, the rest of the
cluster failed to elect any new leaders, and so for 5 minutes, those
partitions were unavailable (and we saw a flood of failed FetcherManager
exceptions from the other nodes in the cluster).  Most of the partitions
were empty (e.g. there's no way the other replicas were behind and not in
the ISR normally).  During this 5 minutes, producers were unable to send
messages due to NotLeaderForPartition exceptions.  Apparently the
controller was still sending them to the unavailable broker.

Finally, when the first node finally came up, the other nodes were somewhat
happy again (but a few partitions remained under-replicated indefinitely).
Because of this, we decided to pause the rolling restart, and try to wait
for the under-replicated partitions to get insync.  Unfortunately, about an
hour later, the whole cluster went foobar (e.g. partitions became
unavailable, brokers logged a flood of Fetcher errors, producers couldn't
find a valid leader, metadata requests timed out, etc.).  In a panic, we
reverted that first node back to 0.8.1.1. This did not help, unfortunately,
so, deciding we'd already probably lost data at this point (and producers
could not send data due to (NotLeaderForPartition exceptions)), we decided
to just forcibly do the upgrade to 0.8.2.1.  This was all a bad situation,
of course.

So, now we have the cluster stable at 0.8.2.1, but like you, we are very,
very nervous about doing any kind of restart to any of our nodes.  We lost
data, primarily in the form of producers failing to send during the periods
of unavailability.

It looks like the root cause, in our case, was a flood of topics created
(long-since unused and empty).  This appears to have caused the longer than
normal controlled shutdown, which in turn, led to the followon problems.
However, in the past, we've seen a controlled shutdown failure result in an
unclean shutdown, but usually the cluster recovers (e.g. it elects new
leaders, and when the new node comes back, it recovers it's partitions that
were uncleanly shutdown).  That did not happen this time (the rest of the
cluster got in an apparent infinite loop where it tried repeatedly (e.g.
500K times a minute) to fetch partitions that were unavailable).

I'm preparing a longer post with more detail (will take a bit of time).

Jason

On Thu, Apr 2, 2015 at 10:19 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> wow, thats scary for sure.
>
> Just to be clear - all you did is restart *one* broker in the cluster?
> everything else was ok before the restart? and that was controlled
> shutdown?
>
> Gwen
>
> On Wed, Apr 1, 2015 at 11:54 AM, Thunder Stumpges <tstump...@ntent.com>
> wrote:
>
> > Well it appears we lost all the data on the one node again. It appears to
> > be all or part of KAFKA-1647<
> > https://issues.apache.org/jira/browse/KAFKA-1647> as we saw this in our
> > logs (for all topics):
> >
> > [2015-04-01 10:46:58,901] WARN Partition [logactivity-redirect,3] on
> > broker 6: No checkpointed highwatermark is found for partition
> > [logactivity-redirect,3] (kafka.cluster.Partition)
> > [2015-04-01 10:46:58,902] WARN Partition [pageimpression,1] on broker 6:
> > No checkpointed highwatermark is found for partition [pageimpression,1]
> > (kafka.cluster.Partition)
> > [2015-04-01 10:46:58,904] WARN Partition [campaignplatformtarget,6] on
> > broker 6: No checkpointed highwatermark is found for partition
> > [campaignplatformtarget,6] (kafka.cluster.Partition)
> > [2015-04-01 10:46:58,905] WARN Partition [trackingtags-c2,1] on broker 6:
> > No checkpointed highwatermark is found for partition [trackingtags-c2,1]
> > (kafka.cluster.Partition)
> >
> > Followed by:
> >
> > [2015-04-01 10:46:58,911] INFO Truncating log trafficshaperlog-3 to
> offset
> > 0. (kafka.log.Log)
> > [2015-04-01 10:46:58,928] INFO Truncating log videorecrequest-0 to offset
> > 0. (kafka.log.Log)
> > [2015-04-01 10:46:58,928] INFO Truncating log filteredredirect-2 to
> offset
> > 0. (kafka.log.Log)
> > [2015-04-01 10:46:58,985] INFO Truncating log precheckrequest-3 to offset
> > 0. (kafka.log.Log)
> > [2015-04-01 10:46:58,990] INFO Truncating log filteredclicklog-8 to
> offset
> > 0. (kafka.log.Log)
> >
> > Followed by:
> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90276704 for log
> > pageview-0 for deletion. (kafka.log.Log)
> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90593329 for log
> > pageview-0 for deletion. (kafka.log.Log)
> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90906048 for log
> > pageview-0 for deletion. (kafka.log.Log)
> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 91226773 for log
> > pageview-0 for deletion. (kafka.log.Log)
> >
> >
> > The strange thing however is that I don't believe we ever had the
> scenario
> > mentioned in the bug (all brokers for a topic down) As we have 0 unclean
> > leader elections, our applications never complained the cluster was down,
> > and we never actually stopped more than one node (the node in question).
> >
> > Anyway, really unsure if this could have somehow been related to the
> > attempted upgrade to 0.8.2.1 (the occurrence this morning was NOT, it
> was a
> > simple stop, restart broker)
> >
> > We are really nervous to do any kafka service restarts now that this has
> > happened twice on this machine.
> > Any suggestions? Should we go back and make another attempt to upgrade to
> > 0.8.2.1 ?
> >
> > Thanks,
> > Thunder
> >
> >
> > From: Thunder Stumpges
> > Sent: Wednesday, April 01, 2015 10:30 AM
> > To: kafka-us...@incubator.apache.org
> > Subject: Problem with node after restart no partitions?
> >
> > Hi group,
> >
> > We've been trying to track down a problem this morning for a little
> while,
> > and thought I'd ask here while we keep looking.
> >
> > We have 3 nodes (rep-3) running 8.1.1. We attempted a rolling upgrade
> > yesterday to 8.2.1, and on the first node, after restarting, a single
> topic
> > (a samza intermediate topic) started throwing replica fetcher errors over
> > and over ("NotLeaderForPartition"). There may or may not have been other
> > things attempted at this time (not by me so I cannot say for sure).
> Anyway
> > we ended up rolling back to 8.1.1 and ALL data had been DELETED from that
> > node. It spent most of yesterday re-syncing, and came into sync last
> night,
> > and a rebalance made everything run smoothly (*except for these damn
> > replica fetcher errors for that one partition).
> >
> > Today my colleague attempted the "unsupported" topic delete command for
> > the "bad" partition, and bounced that one troublesome node.
> >
> > Upon coming up, I can see in server.log that it is reading in all of the
> > segments in, and then starts spitting out a samza topic fetch error, and
> > through JMX the "ReplicaManager".LeaderCount is 0. It is not attempting
> to
> > fetch or load any topics.
> >
> > The other two brokers are showing under-replicated (obviously). What is
> > going wrong? How can we get that samza topic really and truly gone? (if
> > that is the cause of the broker not coming up)
> >
> > Thanks,
> > Thunder
> >
> >
>

Reply via email to