Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Rajiv Kurian Thu, 17 Dec 2015 17:57:07 -0800

Yes we are in the process of upgrading to the new producers. But the
problem seems deeper than a compatibility issue. We have one environment
where the old producers work with the new 0.9 broker. Further when we
reverted our messed up 0.9 environment to 0.8.2.3 the problem with those
topics didn't go away.


Didn't see any ZK issues on the brokers. There were other topics on the
very same brokers that didn't seem to be affected.

On Thu, Dec 17, 2015 at 5:46 PM, Jun Rao <j...@confluent.io> wrote:

> Yes, the new java producer is available in 0.8.2.x and we recommend people
> use that.
>
> Also, when those producers had the issue, were there any other things weird
> in the broker (e.g., broker's ZK session expires)?
>
> Thanks,
>
> Jun
>
> On Thu, Dec 17, 2015 at 2:37 PM, Rajiv Kurian <ra...@signalfx.com> wrote:
>
> > I can't think of anything special about the topics besides the clients
> > being very old (Java wrappers over Scala).
> >
> > I do think it was using ack=0. But my guess is that the logging was done
> by
> > the Kafka producer thread. My application itself was not getting
> exceptions
> > from Kafka.
> >
> > On Thu, Dec 17, 2015 at 2:31 PM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hmm, anything special with those 3 topics? Also, the broker log shows
> > that
> > > the producer uses ack=0, which means the producer shouldn't get errors
> > like
> > > leader not found. Could you clarify on the ack used by the producer?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Dec 17, 2015 at 12:41 PM, Rajiv Kurian <ra...@signalfx.com>
> > wrote:
> > >
> > > > The topic which stopped working had clients that were only using the
> > old
> > > > Java producer that is a wrapper over the Scala producer. Again it
> > seemed
> > > to
> > > > work perfectly in another of our realms where we have the same
> topics,
> > > same
> > > > producers/consumers etc but with less traffic.
> > > >
> > > > On Thu, Dec 17, 2015 at 12:23 PM, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Are you using the new java producer?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Dec 17, 2015 at 9:58 AM, Rajiv Kurian <ra...@signalfx.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > > Answers inline:
> > > > > >
> > > > > > On Thu, Dec 17, 2015 at 9:41 AM, Jun Rao <j...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Rajiv,
> > > > > > >
> > > > > > > Thanks for reporting this.
> > > > > > >
> > > > > > > 1. How did you verify that 3 of the topics are corrupted? Did
> you
> > > use
> > > > > > > DumpLogSegments tool? Also, is there a simple way to reproduce
> > the
> > > > > > > corruption?
> > > > > > >
> > > > > > No I did not. The only reason I had to believe that was no
> writers
> > > > could
> > > > > > write to the topic. I have actually no idea what the problem
> was. I
> > > saw
> > > > > > very frequent (much more than usual) messages of the form:
> > > > > > INFO  [kafka-request-handler-2            ]
> [kafka.server.KafkaApis
> > > > > >       ]: [KafkaApi-6] Close connection due to error handling
> > produce
> > > > > > request with correlation id 294218 from client id  with ack=0
> > > > > > and also message of the form:
> > > > > > INFO  [kafka-network-thread-9092-0        ]
> > [kafka.network.Processor
> > > > > >       ]: Closing socket connection to /some ip
> > > > > > The cluster was actually a critical one so I had no recourse but
> to
> > > > > revert
> > > > > > the change (which like noted didn't fix things). I didn't have
> > enough
> > > > > time
> > > > > > to debug further. The only way I could fix it with my limited
> Kafka
> > > > > > knowledge was (after reverting) deleting the topic and recreating
> > it.
> > > > > > I had updated a low priority cluster before that worked just
> fine.
> > > That
> > > > > > gave me the confidence to upgrade this higher priority cluster
> > which
> > > > did
> > > > > > NOT work out. So the only way for me to try to reproduce it is to
> > try
> > > > > this
> > > > > > on our larger clusters again. But it is critical that we don't
> mess
> > > up
> > > > > this
> > > > > > high priority cluster so I am afraid to try again.
> > > > > >
> > > > > > > 2. As Lance mentioned, if you are using snappy, make sure that
> > you
> > > > > > include
> > > > > > > the right snappy jar (1.1.1.7).
> > > > > > >
> > > > > > Wonder why I don't see Lance's email in this thread. Either way
> we
> > > are
> > > > > not
> > > > > > using compression of any kind on this topic.
> > > > > >
> > > > > > > 3. For the CPU issue, could you do a bit profiling to see which
> > > > thread
> > > > > is
> > > > > > > busy and where it's spending time?
> > > > > > >
> > > > > > Since I had to revert I didn't have the time to profile.
> > Intuitively
> > > it
> > > > > > would seem like the high number of client disconnects/errors and
> > the
> > > > > > increased network usage probably has something to do with the
> high
> > > CPU
> > > > > > (total guess). Again our other (lower traffic) cluster that was
> > > > upgraded
> > > > > > was totally fine so it doesn't seem like it happens all the time.
> > > > > >
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Dec 15, 2015 at 12:52 PM, Rajiv Kurian <
> > ra...@signalfx.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > We had to revert to 0.8.3 because three of our topics seem to
> > > have
> > > > > > gotten
> > > > > > > > corrupted during the upgrade. As soon as we did the upgrade
> > > > producers
> > > > > > to
> > > > > > > > the three topics I mentioned stopped being able to do writes.
> > The
> > > > > > clients
> > > > > > > > complained (occasionally) about leader not found exceptions.
> We
> > > > > > restarted
> > > > > > > > our clients and brokers but that didn't seem to help.
> Actually
> > > even
> > > > > > after
> > > > > > > > reverting to 0.8.3 these three topics were broken. To fix it
> we
> > > had
> > > > > to
> > > > > > > stop
> > > > > > > > all clients, delete the topics, create them again and then
> > > restart
> > > > > the
> > > > > > > > clients.
> > > > > > > >
> > > > > > > > I realize this is not a lot of info. I couldn't wait to get
> > more
> > > > > debug
> > > > > > > info
> > > > > > > > because the cluster was actually being used. Has any one run
> > into
> > > > > > > something
> > > > > > > > like this? Are there any known issues with old
> > > consumers/producers.
> > > > > The
> > > > > > > > topics that got busted had clients writing to them using the
> > old
> > > > Java
> > > > > > > > wrapper over the Scala producer.
> > > > > > > >
> > > > > > > > Here are the steps I took to upgrade.
> > > > > > > >
> > > > > > > > For each broker:
> > > > > > > >
> > > > > > > > 1. Stop the broker.
> > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > 4. Go to step 1.
> > > > > > > > Once all the brokers were running the 0.9 code with
> > > > > > > > inter.broker.protocol.version=0.8.2.X we restarted them one
> by
> > > one
> > > > > with
> > > > > > > > inter.broker.protocol.version=0.9.0.0
> > > > > > > >
> > > > > > > > When reverting I did the following.
> > > > > > > >
> > > > > > > > For each broker.
> > > > > > > >
> > > > > > > > 1. Stop the broker.
> > > > > > > > 2. Restart with the 0.9 broker running with
> > > > > > > > inter.broker.protocol.version=0.8.2.X
> > > > > > > > 3. Wait for under replicated partitions to go down to 0.
> > > > > > > > 4. Go to step 1.
> > > > > > > >
> > > > > > > > Once all the brokers were running 0.9 code with
> > > > > > > > inter.broker.protocol.version=0.8.2.X  I restarted them one
> by
> > > one
> > > > > with
> > > > > > > the
> > > > > > > > 0.8.2.3 broker code. This however like I mentioned did not
> fix
> > > the
> > > > > > three
> > > > > > > > broken topics.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 3:13 PM, Rajiv Kurian <
> > > ra...@signalfx.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Now that it has been a bit longer, the spikes I was seeing
> > are
> > > > gone
> > > > > > but
> > > > > > > > > the CPU and network in/out on the three brokers that were
> > > showing
> > > > > the
> > > > > > > > > spikes are still much higher than before the upgrade. Their
> > > CPUs
> > > > > have
> > > > > > > > > increased from around 1-2% to 12-20%. The network in on the
> > > same
> > > > > > > brokers
> > > > > > > > > has gone up from under 2 Mb/sec to 19-33 Mb/sec. The
> network
> > > out
> > > > > has
> > > > > > > gone
> > > > > > > > > up from under 2 Mb/sec to 29-42 Mb/sec. I don't see a
> > > > corresponding
> > > > > > > > > increase in kafka messages in per second or kafka bytes in
> > per
> > > > > second
> > > > > > > JMX
> > > > > > > > > metrics.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Rajiv
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fallout from upgrading to kafka 0.9 from 0.8.2.3

Reply via email to