As Allen mentions Docker is the big offender. I've added one cleanup `docker system prune -a -f` to run hourly. The problem nodes are the qnodes which have much less space for docker than the other nodes. I'm disabling them for the time being until I can either get a bigger disk or guarantee they don't run out of space weekly.
On Thu, Oct 12, 2017 at 11:23 AM, Allen Wittenauer <[email protected]> wrote: > >> On Oct 12, 2017, at 8:34 AM, Robert Munteanu <[email protected]> wrote: >> Jenkins slaves running out of disk space has been an issue for quite >> some time. Not a major deal-breaker or very frequent, but it's still >> annoying to chase issues, reconfigure slave labels, retrigger builds, >> etc > > > From what I’ve seen, the biggest issues are caused by broken docker > jobs. I don’t think people realize that when their docker jobs fail, the > disk space and container aren’t released. (Docker only automatically cleans > up on *success*!) Apache Yetus has tools to deal with old docker bits on the > system. As a result, on the ‘hadoop’ labeled machines (which have multiple > projects using Yetus precommit in sentinel mode), I don’t think I’ve seen an > out of space on those nodes in a very long time. > > Apache Yetus itself is configured to run on quite a few nodes. When > the (rare) patch comes through that runs on a node that isn’t typically > running Yetus, it isn’t unusual to see months worth of images eating space > and containers still running. It will then wipe out a bunch of the excess. > I should probably add df (and cpu time?) output to see how much it is > reclaiming. In some cases I’ve seen, it’s easily in the high GB area. >
