> Mostly updating version variable in our puppet config file (masterless)
and applying manually per instance. It works surprisingly well this way.

Sure, we do the same, but with Chef. But we still follow that process. Lock
in inter broker and log message format to existing version first. upgrade 1
binary and restart 1 broker.

Then check that everything is OK before proceeding one step at a time. And
OK check is at minimum that there are no under replicated or offline
partitions reported by the cluster.

If you had offline partitions, was it across multiple brokers, or just 1?
And at which part in that process did it happen?

Also, are all topics configured for HA?  ( RF => 2, and RF >
min.insync.replicas).

Being forced into choosing an unclean leadership election sounds to me like
there was at least 1 broker which hadn't recovered yet and I'd expect there
to be something in its logs. And if you were following the rolling upgrade
method correctly it was very likely part way through it?

On Mon, Apr 23, 2018 at 5:42 PM, Mika Linnanoja <mika.linnan...@rovio.com>
wrote:

> Hi,
>
> On Mon, Apr 23, 2018 at 10:25 AM, Brett Rann <br...@zendesk.com.invalid>
> wrote:
>
> > Firstly, 1.0.1 is out and I'd strongly advise you to use that as the
> > upgrade path over 1.0.0 if you can because it contains a lot of bugfixes.
> > Some critical.
> >
>
> Yeah, it would've just meant starting the whole process from scratch in all
> of our clusters. We had several other clusters on 1.0 in production and
> everything was working A-OK with lighter workloads though, so didn't
> consider further versions really.
>
> Luckily we have not hit e.g. that file descriptor bug some of our devs were
> worried about for 1.0 (https://issues.apache.org/jira/browse/KAFKA-6529
> <https://issues.apache.org/jira/browse/KAFKA-6529>).
>
>
> > With unclean leader elections it should have resolved itself when the
> > affected broker came back online and all partitions were available. So
> > probably there was an issue there.
> >
>
> We moved to new (mostly default) config file that comes with 1.0, so no
> unclean elections enabled by default sadly.
>
> As mentioned enabling it for the affected topics fixed this issue straight
> away, but took a while to understand what is going on hence some data loss.
> Random googling to the rescue, I'm first to admit being no kind of kafka
> expert to be honest.
>
> Personally I had a lot of struggles upgrading off of 0.10 with bugged large
> > consumer offset partitions (10s and 100s of GBs) that had stopped
> > compacting and should have been in the MBs. The largest ones took 45
> > minutes to compact which spread out the rolling upgrade time
> significantly.
> > Also occasionally even with a clean shutdown there was corruption
> detected
> > on broker start and it took time for the repair -- a /lot/ of time. In
> both
> > cases it was easily seen in the logs, and significantly increased disk IO
> > metrics on boot (and metrics for FD use gradually returning to previous
> > levels).
> >
>
> Good to know. I didn't see anything odd before/during/after rolling upgrade
> on usual instance level metrics.
>
> Was it all with the one broker, or across multiple? Did you follow the
> > rolling upgrade procedure? At what point in the rolling process did the
> > first issue appear?
> >
> > https://kafka.apache.org/10/documentation/#upgrade
> <https://kafka.apache.org/10/documentation/#upgrade> (that's for
> 1.0.x)
> >
>
> We have the softwares installed via puppet, so it is not exactly according
> to official guide, but I naturally read those first.
>
> Mostly updating version variable in our puppet config file (masterless) and
> applying manually per instance. It works surprisingly well this way.
>
> We just got rid of one Ancient 0.7 kafka cluster, so overall I'm very happy
> with the newer versions, GJ all contributors.
>
> Mika
>



-- 

Brett Rann

Senior DevOps Engineer


Zendesk International Ltd

395 Collins Street, Melbourne VIC 3000 Australia

Mobile: +61 (0) 418 826 017

Reply via email to