[
https://issues.apache.org/jira/browse/KAFKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jun Rao updated KAFKA-1063:
---------------------------
Fix Version/s: (was: 0.8.1)
0.9.0
> run log cleanup at startup
> --------------------------
>
> Key: KAFKA-1063
> URL: https://issues.apache.org/jira/browse/KAFKA-1063
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.8.0
> Reporter: paul mackles
> Assignee: Neha Narkhede
> Priority: Minor
> Fix For: 0.9.0
>
>
> Jun suggested I file this ticket to have the brokers start running cleanup at
> start. Here is the scenario that precipitated it:
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out
> of disk on one of the nodes . As expected, the broker shut itself down and
> all of the clients switched over to the other nodes. So far so good.
> To free up disk space, I reduced log.retention.hours to something more
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2
> nodes were running OK, I first tried to restart the node which ran out of
> disk. Unfortunately, it kept shutting itself down due to the full disk. From
> the logs, I think this was because it was trying to sync-up the replicas it
> was responsible for and of course couldn't due to the lack of disk space. My
> hope was that upon restart, it would see the new retention settings and free
> up a bunch of disk space before trying to do any syncs.
> I then went and restarted the other 2 nodes. They both picked up the new
> retention settings and freed up a bunch of storage as a result. I then went
> back and tried to restart the 3rd node but to no avail. It still had problems
> with the full disks.
> I thought about trying to reassign partitions so that the node in question
> had less to manage but that turned out to be a hassle so I wound up manually
> deleting some of the old log/segment files. The broker seemed to come back
> fine after that but that's not something I would want to do on a production
> server.
> We obviously need better monitoring/alerting to avoid this situation
> altogether, but I am wondering if the order of operations at startup
> could/should be changed to better account for scenarios like this. Or maybe a
> utility to remove old logs after changing ttl? Did I miss a better way to
> handle this?
> Original email thread is here:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3cce6365ae.82d66%[email protected]%3e
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)