Hi,
On 64-core nodes while submitting many short jobs, the number of calls to release_memory agent (symlink to release_common from slurm 2.4.3 release) can be extremely high. It seems that the script is too slow for memory, which results in few 10k agent processes being spawned in a short time after job completion, and the processes stay alive for a long time. In extreme cases, the pid numbers can be exhausted preventing new processes being spawned. To fix it partially, I had commented the "sleep 1" in the sync part of the script. But there can still be up to few k processes after 64 jobs complete in roughly the same time. Each job has about 10 processes, so the number of agent calls can be high. I did not notice that on the nodes with lower no of cores/jobs, and the problem is not present for other cgroups. Any advice how to fix this problem? Cheers, Andrej -- _____________________________________________________________ prof. dr. Andrej Filipcic, E-mail: [email protected] Department of Experimental High Energy Physics - F9 Jozef Stefan Institute, Jamova 39, P.o.Box 3000 SI-1001 Ljubljana, Slovenia Tel.: +386-1-477-3674 Fax: +386-1-425-7074 -------------------------------------------------------------
