I also have tried to run a job at specific node like:
srun --nodelist=sgo2 hostname
and here is my slurmctld log file
[2017-07-28T10:44:04.975] debug2: sched: Processing RPC:
REQUEST_RESOURCE_ALLOCATION from uid=0
[2017-07-28T10:44:04.975] debug3: JobDesc: user_id=0 job_id=N/A
partition=(null) name=hostname
[2017-07-28T10:44:04.975] debug3: cpus=1-4294967294 pn_min_cpus=-1
core_spec=-1
[2017-07-28T10:44:04.975] debug3: Nodes=1-[1] Sock/Node=65534
Core/Sock=65534 Thread/Core=65534
[2017-07-28T10:44:04.975] debug3:
pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
[2017-07-28T10:44:04.975] debug3: immediate=0 features=(null)
reservation=(null)
[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null)
gres=(null)
[2017-07-28T10:44:04.975] debug3: time_limit=-1--1 priority=-1
contiguous=0 shared=-1
[2017-07-28T10:44:04.975] debug3: kill_on_node_fail=-1 script=(null)
[2017-07-28T10:44:04.975] debug3: argv="hostname"
[2017-07-28T10:44:04.975] debug3: stdin=(null) stdout=(null)
stderr=(null)
[2017-07-28T10:44:04.975] debug3: work_dir=/lustre
alloc_node:sid=GO1:31523
[2017-07-28T10:44:04.975] debug3: power_flags=
[2017-07-28T10:44:04.975] debug3: resp_host=192.168.30.74
alloc_resp_port=57496 other_port=46302
[2017-07-28T10:44:04.975] debug3: dependency=(null) account=(null)
qos=(null) comment=(null)
[2017-07-28T10:44:04.975] debug3: mail_type=0 mail_user=(null)
nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2017-07-28T10:44:04.975] debug3: network=(null) begin=Unknown
cpus_per_task=-1 requeue=-1 licenses=(null)
[2017-07-28T10:44:04.975] debug3: end_time= signal=0@0
wait_all_nodes=1 cpu_freq=
[2017-07-28T10:44:04.975] debug3: ntasks_per_node=-1
ntasks_per_socket=-1 ntasks_per_core=-1
[2017-07-28T10:44:04.975] debug3: mem_bind=65534:(null)
plane_size:65534
[2017-07-28T10:44:04.975] debug3: array_inx=(null)
[2017-07-28T10:44:04.975] debug3: burst_buffer=(null)
[2017-07-28T10:44:04.975] debug3: mcs_label=(null)
[2017-07-28T10:44:04.975] debug3: deadline=Unknown
[2017-07-28T10:44:04.975] debug3: bitflags=0 delay_boot=4294967294
[2017-07-28T10:44:04.975] debug3: before alteration asking for nodes
1-1 cpus 1-4294967294
[2017-07-28T10:44:04.975] debug3: after alteration asking for nodes
1-1 cpus 1-4294967294
[2017-07-28T10:44:04.975] debug2: found 5 usable nodes from config
containing sgo[1-5]
[2017-07-28T10:44:04.975] debug3: _pick_best_nodes: job 160 idle_nodes
5 share_nodes 5
[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated
resources: NodeList=sgo2
[2017-07-28T10:44:04.975] sched: _slurm_rpc_allocate_resources
JobId=160 NodeList=sgo2 usec=517
[2017-07-28T10:44:04.975] debug3: Writing job id 160 to header record
of job_state file
[2017-07-28T10:44:04.976] debug2: _slurm_rpc_job_ready(160)=3 usec=2
[2017-07-28T10:44:04.977] debug3: StepDesc: user_id=0 job_id=160
node_count=1-1 cpu_count=8 num_tasks=1
[2017-07-28T10:44:04.977] debug3: cpu_freq_gov=4294967294
cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534
task_dist=0x2000 plane=65534
[2017-07-28T10:44:04.977] debug3: node_list=sgo2 constraints=(null)
[2017-07-28T10:44:04.977] debug3: host=GO1 port=55720
srun_pid=30819 name=hostname network=(null) exclusive=0
[2017-07-28T10:44:04.977] debug3:
checkpoint-dir=/var/slurm/checkpoint checkpoint_int=0
[2017-07-28T10:44:04.977] debug3: mem_per_node=0
resv_port_cnt=65534 immediate=0 no_kill=0
[2017-07-28T10:44:04.977] debug3: overcommit=1 time_limit=0 gres=(null)
[2017-07-28T10:44:04.977] debug3: step_layout cpus = 8 pos = 0
[2017-07-28T10:44:04.977] debug: laying out the 1 tasks on 1 hosts
sgo2 dist 2
[2017-07-28T10:44:04.994] debug2: Processing RPC:
REQUEST_COMPLETE_JOB_ALLOCATION from uid=0, JobId=160 rc=0
[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x1 NodeCnt=1
WEXITSTATUS 0
[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x8003
NodeCnt=1 done
[2017-07-28T10:44:04.994] debug2: _slurm_rpc_complete_job_allocation:
JobID=160 State=0x8003 NodeCnt=1
[2017-07-28T10:44:04.994] debug2: Spawning RPC agent for msg_type
SRUN_JOB_COMPLETE
[2017-07-28T10:44:04.994] debug2: got 1 threads to send out
[2017-07-28T10:44:04.994] debug3: slurm_send_only_node_msg: sent 181
[2017-07-28T10:44:04.999] debug2: Spawning RPC agent for msg_type
REQUEST_TERMINATE_JOB
[2017-07-28T10:44:04.999] debug2: got 1 threads to send out
[2017-07-28T10:44:04.999] debug2: Tree head got back 0 looking for 1
[2017-07-28T10:44:04.999] debug3: Tree sending to GO1
[2017-07-28T10:44:05.000] debug2: Tree head got back 1
[2017-07-28T10:44:05.004] debug2: node_did_resp GO1
[2017-07-28T10:44:05.005] debug: sched: Running job scheduler
[2017-07-28T10:44:09.001] debug3: Writing job id 160 to header record
of job_state file
in this log,
[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null)
gres=(null)
[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated
resources: NodeList=sgo2
I checked that req_nodes are correct what i want.
But, the result is GO1 not GO2.
what happened?
-----Original Message-----
*From:* "Said Mohamed Said"<said.moha...@oist.jp>
*To:* "slurm-dev"<slurm-dev@schedmd.com>;
*Cc:*
*Sent:* 2017-07-28 (금) 10:16:44
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
If you still have a problem, run a job at specific node like [ srun
--nodelist=go2 hostname ] and if the command was not successful check
the corresponding log file for any errors.
cheers,
Said.
------------------------------------------------------------------------
*From:* Lachlan Musicman <data...@gmail.com>
*Sent:* Friday, July 28, 2017 10:02:57 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
Ok! Good, so the servers are there.
You should expect to see output from
srun -w go2 hostname
alternatively you should get a diff hostname if you run
srun --time=0-06:00 --mem=8gb "$@" --pty -u bash -i
for instance.
Try running some stress test with >1 node and #cpus>(#cpus on single
node) in request, that should show multiple nodes. Hopefully.
cheers
L.
------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic
civics is the insistence that we cannot ignore the truth, nor should
we panic about it. It is a shared consciousness that our institutions
have failed and our ecosystem is collapsing, yet we are still here —
and we are creative agents who can shape our destinies. Apocalyptic
civics is the conviction that the only way out is through, and the
only way through is together. "
/Greg Bloom/ @greggish
https://twitter.com/greggish/status/873177525903609857
On 28 July 2017 at 10:57, 허웅 <hoewoongg...@naver.com
<mailto:hoewoongg...@naver.com>> wrote:
Here is my output of sinfo
[root@GO1]~# sinfo -N
NODELIST NODES PARTITION STATE
sgo1 1 party* idle
sgo2 1 party* idle
sgo3 1 party* idle
sgo4 1 party* idle
sgo5 1 party* idle
[root@GO1]~# sn
Fri Jul 28 09:55:53 2017
HOSTNAMES
GO1
GO2
GO3
GO4
GO5
-----Original Message-----
*From:* "Lachlan Musicman"<data...@gmail.com
<mailto:data...@gmail.com>>
*To:* "slurm-dev"<slurm-dev@schedmd.com
<mailto:slurm-dev@schedmd.com>>;
*Cc:*
*Sent:* 2017-07-28 (금) 09:51:40
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
Also - are the nodes up an running wrt SLURM? What is the output of :
sinfo -N
?
(fwiw, I really like the alias sn="sinfo -Nle -o "%.20n %.15C %.8O
%.7t" | uniq" )
cheers
L.
------
"The antidote to apocalypticism is *apocalyptic civics*.
Apocalyptic civics is the insistence that we cannot ignore the
truth, nor should we panic about it. It is a shared consciousness
that our institutions have failed and our ecosystem is collapsing,
yet we are still here — and we are creative agents who can shape
our destinies. Apocalyptic civics is the conviction that the only
way out is through, and the only way through is together. "
/Greg Bloom/ @greggish
https://twitter.com/greggish/status/873177525903609857
<https://twitter.com/greggish/status/873177525903609857>
On 28 July 2017 at 10:47, Lachlan Musicman <data...@gmail.com
<mailto:data...@gmail.com>> wrote:
I think it's because hostname is so undemanding.
How many CPUs does each host have?
You may need to use ((number of cpus per host) + 1) to see
action on another node.
You can try using stress-ng to test higher loads?
https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
<https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/>
cheers
L.
------
"The antidote to apocalypticism is *apocalyptic civics*.
Apocalyptic civics is the insistence that we cannot ignore the
truth, nor should we panic about it. It is a shared
consciousness that our institutions have failed and our
ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic
civics is the conviction that the only way out is through, and
the only way through is together. "
/Greg Bloom/ @greggish
https://twitter.com/greggish/status/873177525903609857
<https://twitter.com/greggish/status/873177525903609857>
On 28 July 2017 at 10:28, 허웅 <hoewoongg...@naver.com
<mailto:hoewoongg...@naver.com>> wrote:
I have 5 nodes include control node.
and my nodes are looking like this
Control Node : GO1
Compute Nodes : GO[1-5]
when i trying to allocate some job to multiple nodes, only
one node works.
example]
$ srun -N5 hostname
GO1
GO1
GO1
GO1
GO1
even I expected like this
$ srun -N5 hostname
GO1
GO2
GO3
GO4
GO5
What should i do?
there are some my configures.
$ scontrol show frontend
FrontendName=GO1 State=IDLE Version=17.02 Reason=(null)
BootTime=2017-06-02T20:14:39
SlurmdStartTime=2017-07-27T16:29:46
FrontendName=GO2 State=IDLE Version=17.02 Reason=(null)
BootTime=2017-07-05T17:54:13
SlurmdStartTime=2017-07-27T16:30:07
FrontendName=GO3 State=IDLE Version=17.02 Reason=(null)
BootTime=2017-07-05T17:22:58
SlurmdStartTime=2017-07-27T16:30:08
FrontendName=GO4 State=IDLE Version=17.02 Reason=(null)
BootTime=2017-07-05T17:21:40
SlurmdStartTime=2017-07-27T16:30:08
FrontendName=GO5 State=IDLE Version=17.02 Reason=(null)
BootTime=2017-07-05T17:21:39
SlurmdStartTime=2017-07-27T16:30:09
$ scontrol ping
Slurmctld(primary/backup) at GO1/(NULL) are UP/DOWN
[slurm.conf]
# slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=GO1
ControlAddr=192.168.30.74
#
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/lib/slurmd
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
SlurmdPidFile=/var/run/slurmd/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=0
TreeWidth=50
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
FastSchedule=1
#
# LOGGING
SlurmctldDebug=7
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=7
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none
#
# COMPUTE NODES
NodeName=sgo[1-5] NodeHostName=GO[1-5]
#NodeAddr=192.168.30.[74,141,68,70,72]
#
# PARTITIONS
PartitionName=party Default=yes Nodes=ALL