Hello Lachlan,

Thanks for your reply. All the nodes have access to /usr/lib64/slurm, but the directory is not shared, each one has its own, and yes they do have the file "select_cons_res.so":

$ ls /usr/lib64/slurm/ | grep select*
select_alps.so
select_bluegene.so
select_cons_res.so
select_cray.so
select_linear.so
select_serial.so

Thanks for the tip, scontrol reconfigure is way easier than restarting the daemons.

As the problems persists, I will post my slurm.conf just in case I am messing something up.

ControlMachine=phb1
ControlAddr=X

MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_res # If I change it to select/linear it works fine
SelectTypeParameters=CR_CPU
#
# LOGGING AND ACCOUNTING
AccountingStorageHost=phb1
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=phb1-cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=phb2 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15877 NodeAddr=X State=UNKNOWN
PartitionName=phb2 -queue Nodes=phb2 Default=Yes MaxTime=1:00:00 State=UP

Regards,

Jose

El 05/10/2016 a las 0:12, Lachlan Musicman escribió:
Re: [slurm-dev] Re: cons_res / CR_CPU - we don't have select plugin type 102
Jose,

Do all the nodes have access to either a shared /usr/lib64/slurm or do they each have their own? And is there a file in that dir (on each machine) called select_cons_res.so?

Also, when changing slurm.conf here's a quick and easy workflow:

1. change slurm.conf
2. deploy to all machines in cluster (I use ansible, but puppet, satellite, clusterssh, pssh, pdsh etc are also good here)
3. on head node: restart slurmctld
4. on head node: run "scontrol reconfigure"


That it. No need to reboot any nodes or even login to them.

cheers
L.



------
The most dangerous phrase in the language is, "We've always done it this way."

- Grace Hopper

On 5 October 2016 at 07:25, Jose Antonio <joseantonio.berna...@um.es <mailto:joseantonio.berna...@um.es>> wrote:

    Hi Manuel,

    Thanks for replying. Yes, I have checked the slurm.conf, they are
    all the same on the server and compute nodes. I restarted the
    slurmd  daemon on the compute nodes and finally restarted the
    slurmctld service on the server. I rebooted the machines too, but
    it keeps showing the same error message on the console (Zero Bytes
    were...) and log files.

    I have also set the PluginDir=/usr/lib64/slurm just in case it
    could not find the plugins, but it does not work either.
    All the partitions are active (idle), they did not turn to down or
    drained state.

    Regards,

    Jose


    El 04/10/2016 a las 20:28, Manuel Rodríguez Pascual escribió:
    Hi Jose,

    I don't know if it's the case, but this error tends to arise
    after changing configuration in slurmctld but not rebooting the
    compute nodes or having there a different configuration. Have you
    double-checked this?

    Best regards,

    Manuel

    El martes, 4 de octubre de 2016, Jose Antonio
    <joseantonio.berna...@um.es <mailto:joseantonio.berna...@um.es>>
    escribió:


        Hi,

        Currently I have set the SelectType parameter to
        "select/linear", which
        works fine. However, when a job is sent to a node, the job
        takes all the
        cpus of the machine, even if it only uses 1 core.

        That is why I changed SelectType to "select/cons_res" and its
        SelectTypeParameters to "CR_CPU", but this doesn't seem to
        work. If I
        try to send a task to a partition, which works with
        select/linear, the
        following message pops up:

        sbatch: error: slurm_receive_msg: Zero Bytes were transmitted
        or received
        sbatch: error: Batch job submission failed: Zero Bytes were
        transmitted
        or received

        The log in the server node (/var/log/slurmctld.log):

        error: we don't have select plugin type 102
        error: select_g_select_jobinfo_unpack: unpack error
        error: Malformed RPC of type REQUEST_SUBMIT_BATCH_JOB(4003)
        received
        error: slurm_receive_msg: Header lengths are longer than data
        received
        error: slurm_receive_msg [155.54.204.200:38850
        <http://155.54.204.200:38850>]: Header lengths are
        longer than data received

        There is no update in the compute node logs after this error
        comes up.

        Any ideas?

        Thanks,

        Jose




Reply via email to