This is “normal” as far as I know. We’ve seen this behavior after unclean shutdowns of 0.10.1.1.
In the event of an unclean shutdown Kafka seems to have to rebuild some indexes and for large data directories this takes some time. We got bit by this a few times recently when we had boxes that powered off unexpectedly which resulted in 2 hours of rebuilding indexes before the brokers returned to a healthy state. > On Jan 6, 2018, at 10:18 AM, Vincent Rischmann <vinc...@rischmann.fr> wrote: > > Here's an excerpt just after the broker started: https://pastebin.com/tZqze4Ya > > After more than 8 hours of recovery the broker finally started. I haven't > read through all 8 hours of log but the parts I looked at are like the > pastebin. > > I'm not seeing much in the log cleaner logs either, they look normal. We have > a couple of compacted topics but seems only the consumer offsets is ever > compacted (the other topics don't have much traffic). > > On Sat, Jan 6, 2018, at 12:02 AM, Brett Rann wrote: >> What do the broker logs say its doing during all that time? >> >> There are some consumer offset / log cleaner bugs which caused us similarly >> log delays. that was easily visible by watching the log cleaner activity in >> the logs, and in our monitoring of partition sizes watching them go down, >> along with IO activity on the host for those files. >> >> On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <vinc...@rischmann.fr> >> wrote: >> >>> Hello, >>> >>> so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this bug >>> https://issues.apache.org/jira/browse/KAFKA-4523 >>> <https://issues.apache.org/jira/browse/KAFKA-4523> >>> Unfortunately while stopping one broker, it crashed exactly because of >>> this bug. No big deal usually, except after restarting Kafka in 0.11.0.2 >>> the recovery is taking a really long time. >>> I have around 6TB of data on that broker, and before when it crashed it >>> usually took around 30 to 45 minutes to recover, but now I'm at almost >>> 5h since Kafka started and it's still not recovered. >>> I'm wondering what could have changed to have such a dramatic effect on >>> recovery time ? Is there maybe something I can tweak to try to reduce >>> the time ? >>> Thanks. >>>