What do the broker logs say its doing during all that time? There are some consumer offset / log cleaner bugs which caused us similarly log delays. that was easily visible by watching the log cleaner activity in the logs, and in our monitoring of partition sizes watching them go down, along with IO activity on the host for those files.
On Sat, Jan 6, 2018 at 7:48 AM, Vincent Rischmann <vinc...@rischmann.fr> wrote: > Hello, > > so I'm upgrading my brokers from 0.10.1.1 to 0.11.0.2 to fix this bug > https://issues.apache.org/jira/browse/KAFKA-4523 > <https://issues.apache.org/jira/browse/KAFKA-4523> > Unfortunately while stopping one broker, it crashed exactly because of > this bug. No big deal usually, except after restarting Kafka in 0.11.0.2 > the recovery is taking a really long time. > I have around 6TB of data on that broker, and before when it crashed it > usually took around 30 to 45 minutes to recover, but now I'm at almost > 5h since Kafka started and it's still not recovered. > I'm wondering what could have changed to have such a dramatic effect on > recovery time ? Is there maybe something I can tweak to try to reduce > the time ? > Thanks. >