Re: Insanely long recovery time with Kafka 0.11.0.2

Vincent Rischmann Sat, 06 Jan 2018 10:26:21 -0800

Hi,

just to clarify: this is the cause of the crash https://pastebin.com/GuF60kvF 
in the broker logs, which is why I referenced 
https://issues.apache.org/jira/browse/KAFKA-4523


I had this crash some time ago and yesterday was in the process of upgrading my 
brokers to 0.11.0.2 in part to address this bug, unfortunately while stopping 
this particular broker it crashed.

What I don't understand is why the recovery time after upgrading was so high. A 
couple of month ago when a broker crashed due to this bug and recovered it 
didn't take nearly as long. In fact, I never had a recovery that long on any 
broker, even when they suffered a kernel panic or power failure.

We have quite a bit of data, however as I said with 0.10.1.1 it "only" took 
around 45 minutes. The broker did do a lot of I/O while recovering (to the 
point where even just running ls was painfully slow) but afair it was the same 
every time a broker recovered. Volume of data hasn't changed much since the 
last crash with 0.10.1.1, in fact I removed a lot of data recently.

Our brokers are running with 3 SATA disks in raid 0 (using mdadm), while 
recovering yesterday atop reported around 200MB/s of read throughput. 

Here are some graphs from our monitoring:

 - CPU usage https://vrischmann.me/files/fr3/cpu.png
 - Disk IO https://vrischmann.me/files/fr3/disk_io.png and 
https://vrischmann.me/files/fr3/disk_usage.png

On Sat, Jan 6, 2018, at 4:53 PM, Ted Yu wrote:
> Ismael:
> We're on the same page.
> 
> 0.11.0.2 was released on 17 Nov 2017.
> 
> By 'recently' in my previous email I meant the change was newer.
> 
> Vincent:
> Did the machine your broker ran on experience power issue ?
> 
> Cheers
> 
> On Sat, Jan 6, 2018 at 7:36 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> 
> > Hi Ted,
> >
> > The change you mention is not part of 0.11.0.2.
> >
> > Ismael
> >
> > On Sat, Jan 6, 2018 at 3:31 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > bq. WARN Found a corrupted index file due to requirement failed: Corrupt
> > > index found, index file
> > > (/data/kafka/data-processed-15/00000000000054942918.index)
> > >
> > > Can you search backward for 00000000000054942918.index in the log to see
> > if
> > > we can find the cause for corruption ?
> > >
> > > This part of code was recently changed by :
> > >
> > > KAFKA-6324; Change LogSegment.delete to deleteIfExists and harden log
> > > recovery
> > >
> > > Cheers
> > >
> > > On Sat, Jan 6, 2018 at 7:18 AM, Vincent Rischmann <vinc...@rischmann.fr>
> > > wrote:
> > >
> > > > Here's an excerpt just after the broker started:
> > > > https://pastebin.com/tZqze4Ya
> > > >
> > > > After more than 8 hours of recovery the broker finally started. I
> > haven't
> > > > read through all 8 hours of log but the parts I looked at are like the
> > > > pastebin.
> > > >
> > > > I'm not seeing much in the log cleaner logs either, they look normal.
> > We
> > > > have a couple of compacted topics but seems only the consumer offsets
> > is
> > > > ever compacted (the other topics don't have much traffic).
> > > >
> > > > On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote:
> > > > > What do the broker logs say its doing during all that time?
> > > > >
> > > > > There are some consumer offset / log cleaner bugs which caused us
> > > > similarly
> > > > > log delays. that was easily visible by watching the log cleaner
> > > activity
> > > > in
> > > > > the logs, and in our monitoring of partition sizes watching them go
> > > down,
> > > > > along with IO activity on the host for those files.
> > > > >
> > > > > On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <
> > > vinc...@rischmann.fr>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this
> > bug
> > > > > > https://issues.apache.org/jira/browse/KAFKA-4523
> > > > > > <https://issues.apache.org/jira/browse/KAFKA-4523>
> > > > > > Unfortunately while stopping one broker, it crashed exactly because
> > > of
> > > > > > this bug. No big deal usually, except after restarting Kafka in
> > > > 0.11.0.2
> > > > > > the recovery is taking a really long time.
> > > > > > I have around 6TB of data on that broker, and before when it
> > crashed
> > > it
> > > > > > usually took around 30 to 45 minutes to recover, but now I'm at
> > > almost
> > > > > > 5h since Kafka started and it's still not recovered.
> > > > > > I'm wondering what could have changed to have such a dramatic
> > effect
> > > on
> > > > > > recovery time ? Is there maybe something I can tweak to try to
> > reduce
> > > > > > the time ?
> > > > > > Thanks.
> > > > > >
> > > >
> > >
> >

Re: Insanely long recovery time with Kafka 0.11.0.2

Reply via email to