i could not find anything related to Frontend in the slurm.conf you posted


anyway, from https://slurm.schedmd.com/slurm.conf.html


*FrontendName*
    Name that Slurm uses to refer to a frontend node. Typically this
    would be the string that "/bin/hostname -s" returns. It may also
    be the fully qualified domain name as returned by "/bin/hostname
    -f" (e.g. "foo1.bar.com"), or any valid domain name associated
    with the host through the host database (/etc/hosts) or DNS,
    depending on the resolver settings. Note that if the short form of
    the hostname is not used, it may prevent use of hostlist
    expressions (the numeric portion in brackets must be at the end of
    the string). If the *FrontendName* is "DEFAULT", the values
    specified with that record will apply to subsequent node
    specifications unless explicitly set to other values in that
    frontend node record or replaced with a different set of default
    values. Each line where *FrontendName* is "DEFAULT" will replace
    or add to previous default values and not a reinitialize the
    default values. Note that since the naming of front end nodes
    would typically not follow that of the compute nodes (e.g. lacking
    X, Y and Z coordinates found in the compute node naming scheme),
    each front end node name should be listed separately and without a
    hostlist expression (i.e. frontend00,frontend01" rather than
"frontend[00-01]").</p>

can you try to update your conf (based on your original one) *not* using hostlist expression ?


On 7/28/2017 1:30 PM, 허웅 wrote:

I modified my slurm.conf like :

NodeName=GO[1-5]

PartitionName=party Default=yes Nodes=GO[1-5]

and I restarted slurmctld and slurmd services.

[root@GO1]~# systemctl start slurmctld

[root@GO1]~# systemctl status slurmctld

● slurmctld.service - Slurm controller daemon

Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled)

Active: failed (Result: exit-code) since 금 2017-07-28 13:26:27 KST; 1s ago

Process: 19583 ExecStart=/usr/sbin/slurmctld (code=exited, status=0/SUCCESS)

 Main PID: 19586 (code=exited, status=1/FAILURE)

 7월 28 13:26:27 GO1 systemd[1]: Starting Slurm controller daemon...

7월 28 13:26:27 GO1 systemd[1]: PID file /var/run/slurmd/slurmctld.pid not readable (yet?) after start.

 7월 28 13:26:27 GO1 systemd[1]: Started Slurm controller daemon.

7월 28 13:26:27 GO1 slurmctld[19586]: fatal: Frontend not configured correctly in slurm.conf. See man slurm.conf look for frontendname.

7월 28 13:26:27 GO1 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE

7월 28 13:26:27 GO1 systemd[1]: Unit slurmctld.service entered failed state.

 7월 28 13:26:27 GO1 systemd[1]: slurmctld.service failed.

[root@GO1]~# systemctl restart slurmd

Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe" for details.

[root@GO1]~# systemctl status slurmd

● slurmd.service - Slurm Node daemon

Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)

Active: failed (Result: exit-code) since 금 2017-07-28 13:27:47 KST; 7s ago

Process: 19922 ExecStart=/usr/sbin/slurmd (code=exited, status=1/FAILURE)

 Main PID: 24228 (code=exited, status=0/SUCCESS)

 7월 28 13:27:47 GO1 systemd[1]: Starting Slurm Node daemon...

7월 28 13:27:47 GO1 systemd[1]: slurmd.service: control process exited, code=exited status=1

 7월 28 13:27:47 GO1 systemd[1]: Failed to start Slurm Node daemon.

 7월 28 13:27:47 GO1 systemd[1]: Unit slurmd.service entered failed state.

 7월 28 13:27:47 GO1 systemd[1]: slurmd.service failed.


[root@GO1]~# /usr/sbin/slurmd
slurmd: fatal: Frontend not configured correctly in slurm.conf. See man slurm.conf look for frontendname.


-----Original Message-----
*From:* "Gilles Gouaillardet"<gil...@rist.or.jp>
*To:* "slurm-dev"<slurm-dev@schedmd.com>;
*Cc:*
*Sent:* 2017-07-28 (금) 11:32:26
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?


what if you use this in your slurm.conf instead ?


# COMPUTE NODES
NodeName=GO[1-5]


# PARTITIONS
PartitionName=party Default=yes Nodes=GO[1-5]


On 7/28/2017 9:28 AM, 허웅 wrote:
> =?utf-8?B?V2h5IG15IHNsdXJtIGlzIHJ1bm5pbmcgb24gb25seSBvbmUgbm9kZT8=?= I
> have 5 nodes include control node.
>
> and my nodes are looking like this
>
> Control Node : GO1
> Compute Nodes : GO[1-5]
>
> when i trying to allocate some job to multiple nodes, only one node
> works.
>
> example]
>
> $ srun -N5 hostname
> GO1
> GO1
> GO1
> GO1
> GO1
>
> even I expected like this
>
> $ srun -N5 hostname
> GO1
> GO2
> GO3
> GO4
> GO5
>
> What should i do?
>
> there are some my configures.
>
> $ scontrol show frontend
> FrontendName=GO1 State=IDLE Version=17.02 Reason=(null)
> BootTime=2017-06-02T20:14:39 SlurmdStartTime=2017-07-27T16:29:46
>
> FrontendName=GO2 State=IDLE Version=17.02 Reason=(null)
> BootTime=2017-07-05T17:54:13 SlurmdStartTime=2017-07-27T16:30:07
>
> FrontendName=GO3 State=IDLE Version=17.02 Reason=(null)
> BootTime=2017-07-05T17:22:58 SlurmdStartTime=2017-07-27T16:30:08
>
> FrontendName=GO4 State=IDLE Version=17.02 Reason=(null)
> BootTime=2017-07-05T17:21:40 SlurmdStartTime=2017-07-27T16:30:08
>
> FrontendName=GO5 State=IDLE Version=17.02 Reason=(null)
> BootTime=2017-07-05T17:21:39 SlurmdStartTime=2017-07-27T16:30:09
>
> $ scontrol ping
> Slurmctld(primary/backup) at GO1/(NULL) are UP/DOWN
>
> [slurm.conf]
> # slurm.conf
> #
> # See the slurm.conf man page for more information.
> #
> ClusterName=linux
> ControlMachine=GO1
> ControlAddr=192.168.30.74
> #
> SlurmUser=slurm
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> StateSaveLocation=/var/lib/slurmd
> SlurmdSpoolDir=/var/spool/slurmd
> SwitchType=switch/none
> MpiDefault=none
> SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
> SlurmdPidFile=/var/run/slurmd/slurmd.pid
> ProctrackType=proctrack/pgid
> ReturnToService=0
> TreeWidth=50
> #
> # TIMERS
> SlurmctldTimeout=300
> SlurmdTimeout=300
> InactiveLimit=0
> MinJobAge=300
> KillWait=30
> Waittime=0
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> FastSchedule=1
> #
> # LOGGING
> SlurmctldDebug=7
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=7
> SlurmdLogFile=/var/log/slurmd.log
> JobCompType=jobcomp/none
> #
> # COMPUTE NODES
> NodeName=sgo[1-5] NodeHostName=GO[1-5]
> #NodeAddr=192.168.30.[74,141,68,70,72]
> #
> # PARTITIONS
> PartitionName=party Default=yes Nodes=ALL
>

Reply via email to