Jose, Do all the nodes have access to either a shared /usr/lib64/slurm or do they each have their own? And is there a file in that dir (on each machine) called select_cons_res.so?
Also, when changing slurm.conf here's a quick and easy workflow: 1. change slurm.conf 2. deploy to all machines in cluster (I use ansible, but puppet, satellite, clusterssh, pssh, pdsh etc are also good here) 3. on head node: restart slurmctld 4. on head node: run "scontrol reconfigure" That it. No need to reboot any nodes or even login to them. cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 5 October 2016 at 07:25, Jose Antonio <joseantonio.berna...@um.es> wrote: > Hi Manuel, > > Thanks for replying. Yes, I have checked the slurm.conf, they are all the > same on the server and compute nodes. I restarted the slurmd daemon on the > compute nodes and finally restarted the slurmctld service on the server. I > rebooted the machines too, but it keeps showing the same error message on > the console (Zero Bytes were...) and log files. > > I have also set the PluginDir=/usr/lib64/slurm just in case it could not > find the plugins, but it does not work either. > All the partitions are active (idle), they did not turn to down or drained > state. > > Regards, > > Jose > > > El 04/10/2016 a las 20:28, Manuel Rodríguez Pascual escribió: > > Hi Jose, > > I don't know if it's the case, but this error tends to arise after > changing configuration in slurmctld but not rebooting the compute nodes or > having there a different configuration. Have you double-checked this? > > Best regards, > > Manuel > > El martes, 4 de octubre de 2016, Jose Antonio <joseantonio.berna...@um.es> > escribió: > >> >> Hi, >> >> Currently I have set the SelectType parameter to "select/linear", which >> works fine. However, when a job is sent to a node, the job takes all the >> cpus of the machine, even if it only uses 1 core. >> >> That is why I changed SelectType to "select/cons_res" and its >> SelectTypeParameters to "CR_CPU", but this doesn't seem to work. If I >> try to send a task to a partition, which works with select/linear, the >> following message pops up: >> >> sbatch: error: slurm_receive_msg: Zero Bytes were transmitted or received >> sbatch: error: Batch job submission failed: Zero Bytes were transmitted >> or received >> >> The log in the server node (/var/log/slurmctld.log): >> >> error: we don't have select plugin type 102 >> error: select_g_select_jobinfo_unpack: unpack error >> error: Malformed RPC of type REQUEST_SUBMIT_BATCH_JOB(4003) received >> error: slurm_receive_msg: Header lengths are longer than data received >> error: slurm_receive_msg [155.54.204.200:38850]: Header lengths are >> longer than data received >> >> There is no update in the compute node logs after this error comes up. >> >> Any ideas? >> >> Thanks, >> >> Jose >> > >