Hi Johnathan. Yes I decreased the retention on all topics simultaneously. I realized my mistake later when I saw the cluster overloaded :)
I wasn't 100% sure so I looked it up, but it looks to me like log.cleaner.threads and log.cleaner.io.max.bytes.per.second only apply when a topic is using cleanup.policy=compact and not cleanup.policy=delete, right ? All my topics were using cleanup.policy=delete I'm still using 0.11.0.2 and looked at the code of LogManager.deleteLogs yesterday and my understanding was that file.delete.delay.ms didn't really do what it said int he documentation, since deleteLogs would delete everything queued. But it looks like the behavior of deleteLogs was changed in 1.1, and if I'm understanding the code correctly it might help in this situation. On Thu, May 3, 2018, at 8:20 AM, Jonathan Bethune wrote: > Howdy Vincent. > > Sounds like a painful situation! I have experienced similar drama with > Kafka so maybe I can offer some advice. > > You said you decreased the retention time on 4 topics. I wonder, was this > done on all 4 topics at the same time? > > Depending on broker and partition config, that can be very painful. With > Kafka you can configure log deletion settings at the topic level. > > In the future you should consider doing these sorts of changes one topic at > a time unless there is some compelling reason to do them simultaneously. > > You also wrote that you saw a spike in CPU load and disk usage. There are a > number of ways you can configure log cleanup so as to use less disk space > and CPU. > > You can reduce retention.bytes or retention.ms to make Kafka run cleanups > more frequently based on log size and time respectively. You can also > directly throttle the log cleaner by setting log.cleaner.threads and > log.cleaner.io.max.bytes.per.second. > > Check the documentation for all the relevant config. > <https://kafka.apache.org/documentation/> > > Again consider setting all of this at the topic level, especially if you're > topics are very different in terms of system resource usage. > > I hope that helps you out a bit. Good luck! > > On 3 May 2018 at 03:52, Vincent Rischmann <vinc...@rischmann.fr> wrote: > > > Hi, > > > > I'm wondering if there is a way to tell Kafka to spread the log file > > deletion when decreasing the retention time of a topic, and if not, if > > it would make sense. > > I'm asking because this afternoon, after decreasing the retention time > > from 2 months to 1 month on 4 of my topics, the whole cluster became > > overloaded for approximately 15 minutes (every broker with 25+ load, > > disk usage almost 100%), with leader reelection, under replicated > > partitions, and a bunch of consumers unable to make progress. > > The change removed 5Tib of data across the 4 topics and I didn't check > > beforehand to make sure how it would affect disk i/o, so it's on me that > > this happened, but seeing how much data was removed I think it would > > make sense to delete only a couple segments at a time in order to not > > overload the disks. > > Right now I can only be careful and plan the decrease in small steps but > > that's going to be a little tedious. > > How does everyone deal with this ? > > > > > > -- > > *Jonathan Bethune - **Senior Consultant* > > JP: +81 70 4069 4357 > > <https://www.facebook.com/instaclustr> <https://twitter.com/instaclustr> > <https://www.linkedin.com/company/instaclustr> > > Read our latest technical blog posts here > <https://www.instaclustr.com/blog/>. This email has been sent on behalf > of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA). This > email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not copy > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message.