After we updated SLURM from 14.11.6 to 14.11.7, we keep getting following
message each minute:
error: slurm_receive_msg: Zero Bytes were transmitted or received
even tough tasks are executed correctly, nodes and slurm utilities work
just fine and everything gets written into DB. We've tried to follow
advices from SLURM devel list and our own fixes, but nothings worked. We
tried:
- full server re-start with slurm service and munge
- time is identical on nodes and controller
- cleaned statesavepath
- before update, we didn't have this error, but even rolling back to
previous version didn't help - we have same error there two
- tried to have multiple ports  SlurmctldPort=6810-6817
- set path to SLURM libraries and executed ldconfig
- searched the group
- right now we have only 1 SLURM version (14.11.7). Old one is completely
removed.
The only thing we didn't try yet is to re-make the slurm cluster with
sacctmgr add cluster ...

What might be causing the error above?
One more things, during SLURM 14.11.7 compilation, it not always could find
munge libraries, so I had to manually compile them several times.

Reply via email to