[ 
https://issues.apache.org/jira/browse/KAFKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-1063:
-----------------------------------
    Fix Version/s:     (was: 0.10.1.0)
                   0.10.2.0

> run log cleanup at startup
> --------------------------
>
>                 Key: KAFKA-1063
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1063
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.0
>            Reporter: paul mackles
>            Assignee: Neha Narkhede
>            Priority: Minor
>             Fix For: 0.10.2.0
>
>
> Jun suggested I file this ticket to have the brokers start running cleanup at 
> start. Here is the scenario that precipitated it:
> We ran into a situation on our dev cluster (3 nodes, v0.8) where we ran out 
> of disk on one of the nodes . As expected, the broker shut itself down and 
> all of the clients switched over to the other nodes. So far so good. 
> To free up disk space, I reduced log.retention.hours to something more 
> manageable (from 172 to 12). I did this on all 3 nodes. Since the other 2 
> nodes were running OK, I first tried to restart the node which ran out of 
> disk. Unfortunately, it kept shutting itself down due to the full disk. From 
> the logs, I think this was because it was trying to sync-up the replicas it 
> was responsible for and of course couldn't due to the lack of disk space. My 
> hope was that upon restart, it would see the new retention settings and free 
> up a bunch of disk space before trying to do any syncs.
> I then went and restarted the other 2 nodes. They both picked up the new 
> retention settings and freed up a bunch of storage as a result. I then went 
> back and tried to restart the 3rd node but to no avail. It still had problems 
> with the full disks.
> I thought about trying to reassign partitions so that the node in question 
> had less to manage but that turned out to be a hassle so I wound up manually 
> deleting some of the old log/segment files. The broker seemed to come back 
> fine after that but that's not something I would want to do on a production 
> server.
> We obviously need better monitoring/alerting to avoid this situation 
> altogether, but I am wondering if the order of operations at startup 
> could/should be changed to better account for scenarios like this. Or maybe a 
> utility to remove old logs after changing ttl? Did I miss a better way to 
> handle this?
> Original email thread is here:
> http://mail-archives.apache.org/mod_mbox/kafka-users/201309.mbox/%3cce6365ae.82d66%25pmack...@adobe.com%3e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to