Here's an excerpt just after the broker started: https://pastebin.com/tZqze4Ya
After more than 8 hours of recovery the broker finally started. I haven't read through all 8 hours of log but the parts I looked at are like the pastebin. I'm not seeing much in the log cleaner logs either, they look normal. We have a couple of compacted topics but seems only the consumer offsets is ever compacted (the other topics don't have much traffic). On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote: > What do the broker logs say its doing during all that time? > > There are some consumer offset / log cleaner bugs which caused us similarly > log delays. that was easily visible by watching the log cleaner activity in > the logs, and in our monitoring of partition sizes watching them go down, > along with IO activity on the host for those files. > > On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <vinc...@rischmann.fr> > wrote: > > > Hello, > > > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this bug > > https://issues.apache.org/jira/browse/KAFKA-4523 > > <https://issues.apache.org/jira/browse/KAFKA-4523> > > Unfortunately while stopping one broker, it crashed exactly because of > > this bug. No big deal usually, except after restarting Kafka in 0.11.0.2 > > the recovery is taking a really long time. > > I have around 6TB of data on that broker, and before when it crashed it > > usually took around 30 to 45 minutes to recover, but now I'm at almost > > 5h since Kafka started and it's still not recovered. > > I'm wondering what could have changed to have such a dramatic effect on > > recovery time ? Is there maybe something I can tweak to try to reduce > > the time ? > > Thanks. > >