Hi all, a few days ago I setup an buildfarm animal markhor, running the tests with CLOBBER_CACHE_RECURSIVELY. As the tests are running very long, reporting the results back to the server fails because of a safeguard limit in the buildfarm server. Anyway, that's being discussed in a different thread - here it's merely as a 'don't bother looking for addax on the buildfarm website' warning.
I've been checking the progress of the recursive tests today, and I found it actually failed in the 'make check' step. The logs are available here: buildfarm logs: http://www.fuzzy.cz/tmp/buildfarm/recursive-oom.tgz kernel logs: http://www.fuzzy.cz/tmp/buildfarm/messages The tests are running within a LXC container (operated through libvirt), so whenever I say 'VM' I actually mean a LXC container. It might be some VM/LXC misconfiguration, but as this happens only to a single VM (the one running tests with recursive clobber), I find it unlikely. ================== An example of the failure ================== parallel group (20 tests): pg_lsn regproc oid name char money float4 txid text int2 varchar int4 float8 boolean int8 uuid rangetypes bit numeric enum ... float4 ... ok float8 ... ok bit ... FAILED (test process exited with exit code 2) numeric ... FAILED (test process exited with exit code 2) txid ... ok ... =============================================================== and then of course the usual 'terminating connection because of crash of another server process' warning. Apparently, it's getting killed by the OOM killer, because it exhausts all the memory assigned to that VM (2GB). May 15 19:44:53 postgres invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 May 15 19:44:53 cspug kernel: postgres cpuset=recursive-builds mems_allowed=0 May 15 19:44:53 cspug kernel: Pid: 17159, comm: postgres Not tainted 2.6.32-431.17.1.el6.centos.plus.x86_64 #1 AFAIK 2GB is more than enough for a buildfarm machine (after all, chipmunk hass just 512MB). Also, this only happens on this VM (cpuset=recursive-builds), the other two VMs, with exactly the same limits, running other buildfarm animals (regular or with CLOBBER_CACHE_ALWAYS) are perfectly happy. See magpie or markhor for example. And I don't see any reason why a build with recursive clobber should require more memory than a regular build. So this seems like a memory leak somewhere in the cache invalidation code. I thought it might be fixed by commit b23b0f5588 (Code review for recent changes in relcache.c), but mite is currently working on 7894ac5004 and yet it already failed on OOM. The failures apparently happen within a few hours of the test start. For example on addax (gcc), the build started on 02:50 and the first OOM failure happened on 05:19, on mite (clang), it's 03:20 vs. 06:50. So it's like ~3-4 after the tests start. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers