Hello Lachlan,
Thanks for your reply. All the nodes have access to /usr/lib64/slurm,
but the directory is not shared, each one has its own, and yes they do
have the file "select_cons_res.so":
$ ls /usr/lib64/slurm/ | grep select*
select_alps.so
select_bluegene.so
select_cons_res.so
select_cray.so
select_linear.so
select_serial.so
Thanks for the tip, scontrol reconfigure is way easier than restarting
the daemons.
As the problems persists, I will post my slurm.conf just in case I am
messing something up.
ControlMachine=phb1
ControlAddr=X
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res # If I change it to select/linear it works fine
SelectTypeParameters=CR_CPU
#
# LOGGING AND ACCOUNTING
AccountingStorageHost=phb1
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=phb1-cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=phb2 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
RealMemory=15877 NodeAddr=X State=UNKNOWN
PartitionName=phb2 -queue Nodes=phb2 Default=Yes MaxTime=1:00:00 State=UP
Regards,
Jose
El 05/10/2016 a las 0:12, Lachlan Musicman escribió:
Re: [slurm-dev] Re: cons_res / CR_CPU - we don't have select plugin
type 102
Jose,
Do all the nodes have access to either a shared /usr/lib64/slurm or do
they each have their own? And is there a file in that dir (on each
machine) called select_cons_res.so?
Also, when changing slurm.conf here's a quick and easy workflow:
1. change slurm.conf
2. deploy to all machines in cluster (I use ansible, but puppet,
satellite, clusterssh, pssh, pdsh etc are also good here)
3. on head node: restart slurmctld
4. on head node: run "scontrol reconfigure"
That it. No need to reboot any nodes or even login to them.
cheers
L.
------
The most dangerous phrase in the language is, "We've always done it
this way."
- Grace Hopper
On 5 October 2016 at 07:25, Jose Antonio <joseantonio.berna...@um.es
<mailto:joseantonio.berna...@um.es>> wrote:
Hi Manuel,
Thanks for replying. Yes, I have checked the slurm.conf, they are
all the same on the server and compute nodes. I restarted the
slurmd daemon on the compute nodes and finally restarted the
slurmctld service on the server. I rebooted the machines too, but
it keeps showing the same error message on the console (Zero Bytes
were...) and log files.
I have also set the PluginDir=/usr/lib64/slurm just in case it
could not find the plugins, but it does not work either.
All the partitions are active (idle), they did not turn to down or
drained state.
Regards,
Jose
El 04/10/2016 a las 20:28, Manuel Rodríguez Pascual escribió:
Hi Jose,
I don't know if it's the case, but this error tends to arise
after changing configuration in slurmctld but not rebooting the
compute nodes or having there a different configuration. Have you
double-checked this?
Best regards,
Manuel
El martes, 4 de octubre de 2016, Jose Antonio
<joseantonio.berna...@um.es <mailto:joseantonio.berna...@um.es>>
escribió:
Hi,
Currently I have set the SelectType parameter to
"select/linear", which
works fine. However, when a job is sent to a node, the job
takes all the
cpus of the machine, even if it only uses 1 core.
That is why I changed SelectType to "select/cons_res" and its
SelectTypeParameters to "CR_CPU", but this doesn't seem to
work. If I
try to send a task to a partition, which works with
select/linear, the
following message pops up:
sbatch: error: slurm_receive_msg: Zero Bytes were transmitted
or received
sbatch: error: Batch job submission failed: Zero Bytes were
transmitted
or received
The log in the server node (/var/log/slurmctld.log):
error: we don't have select plugin type 102
error: select_g_select_jobinfo_unpack: unpack error
error: Malformed RPC of type REQUEST_SUBMIT_BATCH_JOB(4003)
received
error: slurm_receive_msg: Header lengths are longer than data
received
error: slurm_receive_msg [155.54.204.200:38850
<http://155.54.204.200:38850>]: Header lengths are
longer than data received
There is no update in the compute node logs after this error
comes up.
Any ideas?
Thanks,
Jose