Hi,

On 64-core nodes while submitting many short jobs, the number of calls
to release_memory agent (symlink to release_common from slurm 2.4.3
release) can be extremely high. It seems that the script is too slow for
memory, which results in few 10k agent processes being spawned in a
short time after job completion, and the processes stay alive for a long
time. In extreme cases, the pid numbers can be exhausted preventing new
processes being spawned. To fix it partially, I had commented the "sleep
1" in the sync part of the script. But there can still be up to few k
processes after 64 jobs complete in roughly the same time.

Each job has about 10 processes, so the number of agent calls can be high.

I did not notice that on the nodes with lower no of cores/jobs, and the
problem is not present for other cgroups.

Any advice how to fix this problem?

Cheers,
Andrej

-- 
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail: [email protected]
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-425-7074
-------------------------------------------------------------

Reply via email to