Hi All,

I wonder if any of you have seen these errors in slurmdbd.log

error: persistent connection experienced an error

When we see these errors, we are seeing job errors with some kind of accounting 
in slurm like:

slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should 
never happen
slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should 
never happen
srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource 
temporarily unavailable

I haven't been able to figure out what makes the slurmdbd get into this 
condition. The slurm controller, and slurmdbd are on the same box, so it's 
increasingly odd that the slurmdbd can't communicate with slurmctld. While we 
figure this out, we have begun restarting slurmctl and slurmdbd every day to 
try and keep them "in sync". 

Anyone seen this? Any thoughts? Maybe the one port shown here by:

sacctmgr list cluster

Becomes overwhelmed at times? We have a range of ports for the controller to be 
contacted on. Maybe the db should try on another port if that’s the issue?

SlurmctldPort=6900-6950

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

Reply via email to