[slurm-dev] Re: Why my slurm is running on only one node?

허웅 Thu, 27 Jul 2017 19:26:29 -0700

It works good.
 
[root@GO1]~# hostname
GO1

[root@GO2]~# hostnameGO2
[root@GO3]~# hostnameGO3
[root@GO4]~# hostnameGO4
[root@GO5]~# hostnameGO5
Additionally, I've tried to use sbatch command. It worked
$ sbatch test.sh$ sbatch test.sh$ sbatch test.sh$ sbatch test.sh$ sbatch test.sh


[root@GO1]/lustre# cat slurm-1*10:59:54 up 55 days, 14:45, 2 users, load 
average: 8.42, 7.16, 5.46stress: info: [1263] dispatching hogs: 8 cpu, 1 io, 1 
vm, 0 hddstress: info: [1263] successful run completed in 10s11:00:04 up 55 
days, 14:45, 2 users, load average: 9.74, 7.49, 5.58GO110:59:57 up 22 days, 
17:05, 1 user, load average: 0.03, 0.16, 0.12stress: info: [9741] dispatching 
hogs: 8 cpu, 1 io, 1 vm, 0 hddstress: info: [9741] successful run completed in 
11s11:00:08 up 22 days, 17:05, 1 user, load average: 1.49, 0.47, 
0.22GO210:59:57 up 22 days, 17:36, 1 user, load average: 0.07, 0.16, 
0.12stress: info: [16090] dispatching hogs: 8 cpu, 1 io, 1 vm, 0 hddstress: 
info: [16090] successful run completed in 11s11:00:08 up 22 days, 17:37, 1 
user, load average: 1.45, 0.45, 0.21GO310:59:57 up 22 days, 17:38, 1 user, load 
average: 0.04, 0.17, 0.12stress: info: [28462] dispatching hogs: 8 cpu, 1 io, 1 
vm, 0 hddstress: info: [28462] successful run completed in 10s11:00:07 up 22 
days, 17:38, 1 user, load average: 1.49, 0.47, 0.22GO410:59:57 up 22 days, 
17:38, 1 user, load average: 0.00, 0.01, 0.05stress: info: [28974] dispatching 
hogs: 8 cpu, 1 io, 1 vm, 0 hddstress: info: [28974] successful run completed in 
10s11:00:07 up 22 days, 17:38, 1 user, load average: 1.54, 0.34, 0.15GO5

But I do not know why i cant assign jobs to specific node..
does my configs are right?
I'm wondering that NodeName and NodeHostName are correct.
I've tried give same names through NodeName and NodeHostName.
But, could not start slurmctld and slurmd because of config.
So, I changed NodeName to sgo[1-5].
is it correct?
-----Original Message-----
From: "Gilles Gouaillardet"&lt;[email protected]&gt; 
To: "slurm-dev"&lt;[email protected]&gt;; 
Cc: 
Sent: 2017-07-28 (금) 10:56:45
Subject: [slurm-dev] Re: Why my slurm is running on only one node?
 

what if you manually run the hostname command on all your hosts (e.g. do 
not use slurm) ?

do you get the expected result ?



On 7/28/2017 10:51 AM, 허웅 wrote:
&gt;
&gt; I also have tried to run a job at specific node like:
&gt;
&gt; srun --nodelist=sgo2 hostname
&gt;
&gt; and here is my slurmctld log file
&gt;
&gt; [2017-07-28T10:44:04.975] debug2: sched: Processing RPC: 
&gt; REQUEST_RESOURCE_ALLOCATION from uid=0
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: JobDesc: user_id=0 job_id=N/A 
&gt; partition=(null) name=hostname
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    cpus=1-4294967294 pn_min_cpus=-1 
&gt; core_spec=-1
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    Nodes=1-[1] Sock/Node=65534 
&gt; Core/Sock=65534 Thread/Core=65534
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: 
&gt;  pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    immediate=0 features=(null) 
&gt; reservation=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    req_nodes=sgo2 exc_nodes=(null) 
&gt; gres=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    time_limit=-1--1 priority=-1 
&gt; contiguous=0 shared=-1
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    kill_on_node_fail=-1 script=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    argv="hostname"
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    stdin=(null) stdout=(null) 
&gt; stderr=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    work_dir=/lustre 
&gt; alloc_node:sid=GO1:31523
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    power_flags=
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    resp_host=192.168.30.74 
&gt; alloc_resp_port=57496 other_port=46302
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    dependency=(null) account=(null) 
&gt; qos=(null) comment=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    mail_type=0 mail_user=(null) 
&gt; nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    network=(null) begin=Unknown 
&gt; cpus_per_task=-1 requeue=-1 licenses=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    end_time= signal=0@0 
&gt; wait_all_nodes=1 cpu_freq=
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    ntasks_per_node=-1 
&gt; ntasks_per_socket=-1 ntasks_per_core=-1
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    mem_bind=65534:(null) 
&gt; plane_size:65534
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    array_inx=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    burst_buffer=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    mcs_label=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    deadline=Unknown
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    bitflags=0 delay_boot=4294967294
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: before alteration asking for nodes 
&gt; 1-1 cpus 1-4294967294
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: after alteration asking for nodes 
&gt; 1-1 cpus 1-4294967294
&gt;
&gt; [2017-07-28T10:44:04.975] debug2: found 5 usable nodes from config 
&gt; containing sgo[1-5]
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: _pick_best_nodes: job 160 idle_nodes 
&gt; 5 share_nodes 5
&gt;
&gt; [2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated 
&gt; resources: NodeList=sgo2
&gt;
&gt; [2017-07-28T10:44:04.975] sched: _slurm_rpc_allocate_resources 
&gt; JobId=160 NodeList=sgo2 usec=517
&gt;
&gt; [2017-07-28T10:44:04.975] debug3: Writing job id 160 to header record 
&gt; of job_state file
&gt;
&gt; [2017-07-28T10:44:04.976] debug2: _slurm_rpc_job_ready(160)=3 usec=2
&gt;
&gt; [2017-07-28T10:44:04.977] debug3: StepDesc: user_id=0 job_id=160 
&gt; node_count=1-1 cpu_count=8 num_tasks=1
&gt;
&gt; [2017-07-28T10:44:04.977] debug3:    cpu_freq_gov=4294967294 
&gt; cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534 
&gt; task_dist=0x2000 plane=65534
&gt;
&gt; [2017-07-28T10:44:04.977] debug3:    node_list=sgo2  constraints=(null)
&gt;
&gt; [2017-07-28T10:44:04.977] debug3:    host=GO1 port=55720 
&gt; srun_pid=30819 name=hostname network=(null) exclusive=0
&gt;
&gt; [2017-07-28T10:44:04.977] debug3: 
&gt;  checkpoint-dir=/var/slurm/checkpoint checkpoint_int=0
&gt;
&gt; [2017-07-28T10:44:04.977] debug3:    mem_per_node=0 
&gt; resv_port_cnt=65534 immediate=0 no_kill=0
&gt;
&gt; [2017-07-28T10:44:04.977] debug3:    overcommit=1 time_limit=0 gres=(null)
&gt;
&gt; [2017-07-28T10:44:04.977] debug3: step_layout cpus = 8 pos = 0
&gt;
&gt; [2017-07-28T10:44:04.977] debug:  laying out the 1 tasks on 1 hosts 
&gt; sgo2 dist 2
&gt;
&gt; [2017-07-28T10:44:04.994] debug2: Processing RPC: 
&gt; REQUEST_COMPLETE_JOB_ALLOCATION from uid=0, JobId=160 rc=0
&gt;
&gt; [2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x1 NodeCnt=1 
&gt; WEXITSTATUS 0
&gt;
&gt; [2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x8003 
&gt; NodeCnt=1 done
&gt;
&gt; [2017-07-28T10:44:04.994] debug2: _slurm_rpc_complete_job_allocation: 
&gt; JobID=160 State=0x8003 NodeCnt=1
&gt;
&gt; [2017-07-28T10:44:04.994] debug2: Spawning RPC agent for msg_type 
&gt; SRUN_JOB_COMPLETE
&gt;
&gt; [2017-07-28T10:44:04.994] debug2: got 1 threads to send out
&gt;
&gt; [2017-07-28T10:44:04.994] debug3: slurm_send_only_node_msg: sent 181
&gt;
&gt; [2017-07-28T10:44:04.999] debug2: Spawning RPC agent for msg_type 
&gt; REQUEST_TERMINATE_JOB
&gt;
&gt; [2017-07-28T10:44:04.999] debug2: got 1 threads to send out
&gt;
&gt; [2017-07-28T10:44:04.999] debug2: Tree head got back 0 looking for 1
&gt;
&gt; [2017-07-28T10:44:04.999] debug3: Tree sending to GO1
&gt;
&gt; [2017-07-28T10:44:05.000] debug2: Tree head got back 1
&gt;
&gt; [2017-07-28T10:44:05.004] debug2: node_did_resp GO1
&gt;
&gt; [2017-07-28T10:44:05.005] debug:  sched: Running job scheduler
&gt;
&gt; [2017-07-28T10:44:09.001] debug3: Writing job id 160 to header record 
&gt; of job_state file
&gt;
&gt;
&gt; in this log,
&gt;
&gt; [2017-07-28T10:44:04.975] debug3:    req_nodes=sgo2 exc_nodes=(null) 
&gt; gres=(null)
&gt;
&gt; [2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated 
&gt; resources: NodeList=sgo2
&gt;
&gt; I checked that req_nodes are correct what i want.
&gt;
&gt; But, the result is GO1 not GO2.
&gt;
&gt; what happened?
&gt;
&gt; -----Original Message-----
&gt; *From:* "Said Mohamed Said"&lt;[email protected]&gt;
&gt; *To:* "slurm-dev"&lt;[email protected]&gt;;
&gt; *Cc:*
&gt; *Sent:* 2017-07-28 (금) 10:16:44
&gt; *Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
&gt;
&gt; If you still have a problem, run a job at specific node like  [ srun 
&gt; --nodelist=go2 hostname ] and if the command was not successful check 
&gt; the corresponding log file for any errors.
&gt;
&gt;
&gt; cheers,
&gt;
&gt;
&gt; Said.
&gt;
&gt; ------------------------------------------------------------------------
&gt; *From:* Lachlan Musicman &lt;[email protected]&gt;
&gt; *Sent:* Friday, July 28, 2017 10:02:57 AM
&gt; *To:* slurm-dev
&gt; *Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
&gt; Ok! Good, so the servers are there.
&gt;
&gt; You should expect to see output from
&gt;
&gt; srun -w go2 hostname
&gt;
&gt; alternatively you should get a diff hostname if you run
&gt;
&gt; srun --time=0-06:00 --mem=8gb "$@" --pty -u bash -i
&gt;
&gt; for instance.
&gt;
&gt; Try running some stress test with &gt;1 node and #cpus&gt;(#cpus on single 
&gt; node) in request, that should show multiple nodes. Hopefully.
&gt;
&gt; cheers
&gt; L.
&gt;
&gt;
&gt;
&gt;
&gt; ------
&gt; "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic 
&gt; civics is the insistence that we cannot ignore the truth, nor should 
&gt; we panic about it. It is a shared consciousness that our institutions 
&gt; have failed and our ecosystem is collapsing, yet we are still here — 
&gt; and we are creative agents who can shape our destinies. Apocalyptic 
&gt; civics is the conviction that the only way out is through, and the 
&gt; only way through is together. "
&gt;
&gt; /Greg Bloom/ @greggish 
&gt; https://twitter.com/greggish/status/873177525903609857
&gt;
&gt; On 28 July 2017 at 10:57, 허웅 &lt;[email protected] 
&gt; &lt;mailto:[email protected]&gt;&gt; wrote:
&gt;
&gt;     Here is my output of sinfo
&gt;
&gt;     [root@GO1]~# sinfo -N
&gt;
&gt;     NODELIST   NODES PARTITION STATE
&gt;
&gt;     sgo1           1    party* idle
&gt;
&gt;     sgo2           1    party* idle
&gt;
&gt;     sgo3           1    party* idle
&gt;
&gt;     sgo4           1    party* idle
&gt;
&gt;     sgo5           1    party* idle
&gt;
&gt;     [root@GO1]~# sn
&gt;     Fri Jul 28 09:55:53 2017
&gt;                HOSTNAMES
&gt;                      GO1
&gt;                      GO2
&gt;                      GO3
&gt;                      GO4
&gt;                      GO5
&gt;
&gt;     -----Original Message-----
&gt;     *From:* "Lachlan Musicman"&lt;[email protected]
&gt;     &lt;mailto:[email protected]&gt;&gt;
&gt;     *To:* "slurm-dev"&lt;[email protected]
&gt;     &lt;mailto:[email protected]&gt;&gt;;
&gt;     *Cc:*
&gt;     *Sent:* 2017-07-28 (금) 09:51:40
&gt;     *Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
&gt;
&gt;     Also - are the nodes up an running wrt SLURM? What is the output of :
&gt;     sinfo -N
&gt;
&gt;     ?
&gt;     (fwiw, I really like the alias sn="sinfo -Nle -o "%.20n %.15C %.8O
&gt;     %.7t" | uniq" )
&gt;     cheers
&gt;     L.
&gt;
&gt;     ------
&gt;     "The antidote to apocalypticism is *apocalyptic civics*.
&gt;     Apocalyptic civics is the insistence that we cannot ignore the
&gt;     truth, nor should we panic about it. It is a shared consciousness
&gt;     that our institutions have failed and our ecosystem is collapsing,
&gt;     yet we are still here — and we are creative agents who can shape
&gt;     our destinies. Apocalyptic civics is the conviction that the only
&gt;     way out is through, and the only way through is together. "
&gt;
&gt;     /Greg Bloom/ @greggish
&gt;     https://twitter.com/greggish/status/873177525903609857
&gt;     &lt;https://twitter.com/greggish/status/873177525903609857&gt;
&gt;
&gt;     On 28 July 2017 at 10:47, Lachlan Musicman &lt;[email protected]
&gt;     &lt;mailto:[email protected]&gt;&gt; wrote:
&gt;
&gt;         I think it's because hostname is so undemanding.
&gt;         How many CPUs does each host have?
&gt;         You may need to use ((number of cpus per host) + 1) to see
&gt;         action on another node.
&gt;         You can try using stress-ng to test higher loads?
&gt;
&gt;         
https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
&gt;         
&lt;https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/&gt;
&gt;         cheers
&gt;         L.
&gt;
&gt;         ------
&gt;         "The antidote to apocalypticism is *apocalyptic civics*.
&gt;         Apocalyptic civics is the insistence that we cannot ignore the
&gt;         truth, nor should we panic about it. It is a shared
&gt;         consciousness that our institutions have failed and our
&gt;         ecosystem is collapsing, yet we are still here — and we are
&gt;         creative agents who can shape our destinies. Apocalyptic
&gt;         civics is the conviction that the only way out is through, and
&gt;         the only way through is together. "
&gt;
&gt;         /Greg Bloom/ @greggish
&gt;         https://twitter.com/greggish/status/873177525903609857
&gt;         &lt;https://twitter.com/greggish/status/873177525903609857&gt;
&gt;
&gt;         On 28 July 2017 at 10:28, 허웅 &lt;[email protected]
&gt;         &lt;mailto:[email protected]&gt;&gt; wrote:
&gt;
&gt;             I have 5 nodes include control node.
&gt;
&gt;             and my nodes are looking like this
&gt;
&gt;             Control Node : GO1
&gt;             Compute Nodes : GO[1-5]
&gt;
&gt;             when i trying to allocate some job to multiple nodes, only
&gt;             one node works.
&gt;
&gt;             example]
&gt;
&gt;             $ srun -N5 hostname
&gt;             GO1
&gt;             GO1
&gt;             GO1
&gt;             GO1
&gt;             GO1
&gt;
&gt;             even I expected like this
&gt;
&gt;             $ srun -N5 hostname
&gt;             GO1
&gt;             GO2
&gt;             GO3
&gt;             GO4
&gt;             GO5
&gt;
&gt;             What should i do?
&gt;
&gt;             there are some my configures.
&gt;
&gt;             $ scontrol show frontend
&gt;             FrontendName=GO1 State=IDLE Version=17.02 Reason=(null)
&gt;             BootTime=2017-06-02T20:14:39
&gt;             SlurmdStartTime=2017-07-27T16:29:46
&gt;
&gt;             FrontendName=GO2 State=IDLE Version=17.02 Reason=(null)
&gt;             BootTime=2017-07-05T17:54:13
&gt;             SlurmdStartTime=2017-07-27T16:30:07
&gt;
&gt;             FrontendName=GO3 State=IDLE Version=17.02 Reason=(null)
&gt;             BootTime=2017-07-05T17:22:58
&gt;             SlurmdStartTime=2017-07-27T16:30:08
&gt;
&gt;             FrontendName=GO4 State=IDLE Version=17.02 Reason=(null)
&gt;             BootTime=2017-07-05T17:21:40
&gt;             SlurmdStartTime=2017-07-27T16:30:08
&gt;
&gt;             FrontendName=GO5 State=IDLE Version=17.02 Reason=(null)
&gt;             BootTime=2017-07-05T17:21:39
&gt;             SlurmdStartTime=2017-07-27T16:30:09
&gt;
&gt;             $ scontrol ping
&gt;             Slurmctld(primary/backup) at GO1/(NULL) are UP/DOWN
&gt;
&gt;             [slurm.conf]
&gt;             # slurm.conf
&gt;             #
&gt;             # See the slurm.conf man page for more information.
&gt;             #
&gt;             ClusterName=linux
&gt;             ControlMachine=GO1
&gt;             ControlAddr=192.168.30.74
&gt;             #
&gt;             SlurmUser=slurm
&gt;             SlurmctldPort=6817
&gt;             SlurmdPort=6818
&gt;             AuthType=auth/munge
&gt;             StateSaveLocation=/var/lib/slurmd
&gt;             SlurmdSpoolDir=/var/spool/slurmd
&gt;             SwitchType=switch/none
&gt;             MpiDefault=none
&gt;             SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
&gt;             SlurmdPidFile=/var/run/slurmd/slurmd.pid
&gt;             ProctrackType=proctrack/pgid
&gt;             ReturnToService=0
&gt;             TreeWidth=50
&gt;             #
&gt;             # TIMERS
&gt;             SlurmctldTimeout=300
&gt;             SlurmdTimeout=300
&gt;             InactiveLimit=0
&gt;             MinJobAge=300
&gt;             KillWait=30
&gt;             Waittime=0
&gt;             #
&gt;             # SCHEDULING
&gt;             SchedulerType=sched/backfill
&gt;             FastSchedule=1
&gt;             #
&gt;             # LOGGING
&gt;             SlurmctldDebug=7
&gt;             SlurmctldLogFile=/var/log/slurmctld.log
&gt;             SlurmdDebug=7
&gt;             SlurmdLogFile=/var/log/slurmd.log
&gt;             JobCompType=jobcomp/none
&gt;             #
&gt;             # COMPUTE NODES
&gt;             NodeName=sgo[1-5] NodeHostName=GO[1-5]
&gt;             #NodeAddr=192.168.30.[74,141,68,70,72]
&gt;             #
&gt;             # PARTITIONS
&gt;             PartitionName=party Default=yes Nodes=ALL
&gt;
&gt;

[slurm-dev] Re: Why my slurm is running on only one node?

Reply via email to