Jose,

Do all the nodes have access to either a shared /usr/lib64/slurm or do they
each have their own? And is there a file in that dir (on each machine)
called select_cons_res.so?

Also, when changing slurm.conf here's a quick and easy workflow:

1. change slurm.conf
2. deploy to all machines in cluster (I use ansible, but puppet, satellite,
clusterssh, pssh, pdsh etc are also good here)
3. on head node: restart slurmctld
4. on head node: run "scontrol reconfigure"


That it. No need to reboot any nodes or even login to them.

cheers
L.



------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

On 5 October 2016 at 07:25, Jose Antonio <joseantonio.berna...@um.es> wrote:

> Hi Manuel,
>
> Thanks for replying. Yes, I have checked the slurm.conf, they are all the
> same on the server and compute nodes. I restarted the slurmd  daemon on the
> compute nodes and finally restarted the slurmctld service on the server. I
> rebooted the machines too, but it keeps showing the same error message on
> the console (Zero Bytes were...) and log files.
>
> I have also set the PluginDir=/usr/lib64/slurm just in case it could not
> find the plugins, but it does not work either.
> All the partitions are active (idle), they did not turn to down or drained
> state.
>
> Regards,
>
> Jose
>
>
> El 04/10/2016 a las 20:28, Manuel Rodríguez Pascual escribió:
>
> Hi Jose,
>
> I don't know if it's the case, but this error tends to arise after
> changing configuration in slurmctld but not rebooting the compute nodes or
> having there a different configuration. Have you double-checked this?
>
> Best regards,
>
> Manuel
>
> El martes, 4 de octubre de 2016, Jose Antonio <joseantonio.berna...@um.es>
> escribió:
>
>>
>> Hi,
>>
>> Currently I have set the SelectType parameter to "select/linear", which
>> works fine. However, when a job is sent to a node, the job takes all the
>> cpus of the machine, even if it only uses 1 core.
>>
>> That is why I changed SelectType to "select/cons_res" and its
>> SelectTypeParameters to "CR_CPU", but this doesn't seem to work. If I
>> try to send a task to a partition, which works with select/linear, the
>> following message pops up:
>>
>> sbatch: error: slurm_receive_msg: Zero Bytes were transmitted or received
>> sbatch: error: Batch job submission failed: Zero Bytes were transmitted
>> or received
>>
>> The log in the server node (/var/log/slurmctld.log):
>>
>> error: we don't have select plugin type 102
>> error: select_g_select_jobinfo_unpack: unpack error
>> error: Malformed RPC of type REQUEST_SUBMIT_BATCH_JOB(4003) received
>> error: slurm_receive_msg: Header lengths are longer than data received
>> error: slurm_receive_msg [155.54.204.200:38850]: Header lengths are
>> longer than data received
>>
>> There is no update in the compute node logs after this error comes up.
>>
>> Any ideas?
>>
>> Thanks,
>>
>> Jose
>>
>
>

Reply via email to