Hi All, I wonder if any of you have seen these errors in slurmdbd.log
error: persistent connection experienced an error When we see these errors, we are seeing job errors with some kind of accounting in slurm like: slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen slurmstepd: error: _prec_extra: Could not find task_cpuacct_cg, this should never happen srun: fatal: slurm_allocation_msg_thr_create: pthread_create error Resource temporarily unavailable I haven't been able to figure out what makes the slurmdbd get into this condition. The slurm controller, and slurmdbd are on the same box, so it's increasingly odd that the slurmdbd can't communicate with slurmctld. While we figure this out, we have begun restarting slurmctl and slurmdbd every day to try and keep them "in sync". Anyone seen this? Any thoughts? Maybe the one port shown here by: sacctmgr list cluster Becomes overwhelmed at times? We have a range of ports for the controller to be contacted on. Maybe the db should try on another port if that’s the issue? SlurmctldPort=6900-6950 Best, Chris -- Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167