[slurm-dev] Re: Why my slurm is running on only one node?

Gilles Gouaillardet Thu, 27 Jul 2017 18:56:59 -0700

what if you manually run the hostname command on all your hosts (e.g. donot use slurm) ?


do you get the expected result ?



On 7/28/2017 10:51 AM, 허웅 wrote:

I also have tried to run a job at specific node like:

srun --nodelist=sgo2 hostname

and here is my slurmctld log file
[2017-07-28T10:44:04.975] debug2: sched: Processing RPC:REQUEST_RESOURCE_ALLOCATION from uid=0
[2017-07-28T10:44:04.975] debug3: JobDesc: user_id=0 job_id=N/Apartition=(null) name=hostname
[2017-07-28T10:44:04.975] debug3: cpus=1-4294967294 pn_min_cpus=-1core_spec=-1
[2017-07-28T10:44:04.975] debug3: Nodes=1-[1] Sock/Node=65534Core/Sock=65534 Thread/Core=65534
[2017-07-28T10:44:04.975] debug3:pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
[2017-07-28T10:44:04.975] debug3: immediate=0 features=(null)reservation=(null)
[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null)gres=(null)
[2017-07-28T10:44:04.975] debug3: time_limit=-1--1 priority=-1contiguous=0 shared=-1
[2017-07-28T10:44:04.975] debug3:    kill_on_node_fail=-1 script=(null)

[2017-07-28T10:44:04.975] debug3:    argv="hostname"
[2017-07-28T10:44:04.975] debug3: stdin=(null) stdout=(null)stderr=(null)
[2017-07-28T10:44:04.975] debug3: work_dir=/lustrealloc_node:sid=GO1:31523
[2017-07-28T10:44:04.975] debug3:    power_flags=
[2017-07-28T10:44:04.975] debug3: resp_host=192.168.30.74alloc_resp_port=57496 other_port=46302
[2017-07-28T10:44:04.975] debug3: dependency=(null) account=(null)qos=(null) comment=(null)
[2017-07-28T10:44:04.975] debug3: mail_type=0 mail_user=(null)nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)
[2017-07-28T10:44:04.975] debug3: network=(null) begin=Unknowncpus_per_task=-1 requeue=-1 licenses=(null)
[2017-07-28T10:44:04.975] debug3: end_time= signal=0@0wait_all_nodes=1 cpu_freq=
[2017-07-28T10:44:04.975] debug3: ntasks_per_node=-1ntasks_per_socket=-1 ntasks_per_core=-1
[2017-07-28T10:44:04.975] debug3: mem_bind=65534:(null)plane_size:65534
[2017-07-28T10:44:04.975] debug3:    array_inx=(null)

[2017-07-28T10:44:04.975] debug3:    burst_buffer=(null)

[2017-07-28T10:44:04.975] debug3:    mcs_label=(null)

[2017-07-28T10:44:04.975] debug3:    deadline=Unknown

[2017-07-28T10:44:04.975] debug3:    bitflags=0 delay_boot=4294967294
[2017-07-28T10:44:04.975] debug3: before alteration asking for nodes1-1 cpus 1-4294967294
[2017-07-28T10:44:04.975] debug3: after alteration asking for nodes1-1 cpus 1-4294967294
[2017-07-28T10:44:04.975] debug2: found 5 usable nodes from configcontaining sgo[1-5]
[2017-07-28T10:44:04.975] debug3: _pick_best_nodes: job 160 idle_nodes5 share_nodes 5
[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocatedresources: NodeList=sgo2
[2017-07-28T10:44:04.975] sched: _slurm_rpc_allocate_resourcesJobId=160 NodeList=sgo2 usec=517
[2017-07-28T10:44:04.975] debug3: Writing job id 160 to header recordof job_state file
[2017-07-28T10:44:04.976] debug2: _slurm_rpc_job_ready(160)=3 usec=2
[2017-07-28T10:44:04.977] debug3: StepDesc: user_id=0 job_id=160node_count=1-1 cpu_count=8 num_tasks=1
[2017-07-28T10:44:04.977] debug3: cpu_freq_gov=4294967294cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534task_dist=0x2000 plane=65534
[2017-07-28T10:44:04.977] debug3:    node_list=sgo2  constraints=(null)
[2017-07-28T10:44:04.977] debug3: host=GO1 port=55720srun_pid=30819 name=hostname network=(null) exclusive=0
[2017-07-28T10:44:04.977] debug3:checkpoint-dir=/var/slurm/checkpoint checkpoint_int=0
[2017-07-28T10:44:04.977] debug3: mem_per_node=0resv_port_cnt=65534 immediate=0 no_kill=0
[2017-07-28T10:44:04.977] debug3:    overcommit=1 time_limit=0 gres=(null)

[2017-07-28T10:44:04.977] debug3: step_layout cpus = 8 pos = 0
[2017-07-28T10:44:04.977] debug: laying out the 1 tasks on 1 hostssgo2 dist 2
[2017-07-28T10:44:04.994] debug2: Processing RPC:REQUEST_COMPLETE_JOB_ALLOCATION from uid=0, JobId=160 rc=0
[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x1 NodeCnt=1WEXITSTATUS 0
[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x8003NodeCnt=1 done
[2017-07-28T10:44:04.994] debug2: _slurm_rpc_complete_job_allocation:JobID=160 State=0x8003 NodeCnt=1
[2017-07-28T10:44:04.994] debug2: Spawning RPC agent for msg_typeSRUN_JOB_COMPLETE
[2017-07-28T10:44:04.994] debug2: got 1 threads to send out

[2017-07-28T10:44:04.994] debug3: slurm_send_only_node_msg: sent 181
[2017-07-28T10:44:04.999] debug2: Spawning RPC agent for msg_typeREQUEST_TERMINATE_JOB
[2017-07-28T10:44:04.999] debug2: got 1 threads to send out

[2017-07-28T10:44:04.999] debug2: Tree head got back 0 looking for 1

[2017-07-28T10:44:04.999] debug3: Tree sending to GO1

[2017-07-28T10:44:05.000] debug2: Tree head got back 1

[2017-07-28T10:44:05.004] debug2: node_did_resp GO1

[2017-07-28T10:44:05.005] debug:  sched: Running job scheduler
[2017-07-28T10:44:09.001] debug3: Writing job id 160 to header recordof job_state file
in this log,
[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null)gres=(null)
[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocatedresources: NodeList=sgo2
I checked that req_nodes are correct what i want.

But, the result is GO1 not GO2.

what happened?

-----Original Message-----
*From:* "Said Mohamed Said"<said.moha...@oist.jp>
*To:* "slurm-dev"<slurm-dev@schedmd.com>;
*Cc:*
*Sent:* 2017-07-28 (금) 10:16:44
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
If you still have a problem, run a job at specific node like [ srun--nodelist=go2 hostname ] and if the command was not successful checkthe corresponding log file for any errors.
cheers,


Said.

------------------------------------------------------------------------
*From:* Lachlan Musicman <data...@gmail.com>
*Sent:* Friday, July 28, 2017 10:02:57 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
Ok! Good, so the servers are there.

You should expect to see output from

srun -w go2 hostname

alternatively you should get a diff hostname if you run

srun --time=0-06:00 --mem=8gb "$@" --pty -u bash -i

for instance.
Try running some stress test with >1 node and #cpus>(#cpus on singlenode) in request, that should show multiple nodes. Hopefully.
cheers
L.




------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalypticcivics is the insistence that we cannot ignore the truth, nor shouldwe panic about it. It is a shared consciousness that our institutionshave failed and our ecosystem is collapsing, yet we are still here —and we are creative agents who can shape our destinies. Apocalypticcivics is the conviction that the only way out is through, and theonly way through is together. "
/Greg Bloom/ @greggishhttps://twitter.com/greggish/status/873177525903609857
On 28 July 2017 at 10:57, 허웅 <hoewoongg...@naver.com<mailto:hoewoongg...@naver.com>> wrote:
    Here is my output of sinfo

    [root@GO1]~# sinfo -N

    NODELIST   NODES PARTITION STATE

    sgo1           1    party* idle

    sgo2           1    party* idle

    sgo3           1    party* idle

    sgo4           1    party* idle

    sgo5           1    party* idle

    [root@GO1]~# sn
    Fri Jul 28 09:55:53 2017
               HOSTNAMES
                     GO1
                     GO2
                     GO3
                     GO4
                     GO5

    -----Original Message-----
    *From:* "Lachlan Musicman"<data...@gmail.com
    <mailto:data...@gmail.com>>
    *To:* "slurm-dev"<slurm-dev@schedmd.com
    <mailto:slurm-dev@schedmd.com>>;
    *Cc:*
    *Sent:* 2017-07-28 (금) 09:51:40
    *Subject:* [slurm-dev] Re: Why my slurm is running on only one node?

    Also - are the nodes up an running wrt SLURM? What is the output of :
    sinfo -N

    ?
    (fwiw, I really like the alias sn="sinfo -Nle -o "%.20n %.15C %.8O
    %.7t" | uniq" )
    cheers
    L.

    ------
    "The antidote to apocalypticism is *apocalyptic civics*.
    Apocalyptic civics is the insistence that we cannot ignore the
    truth, nor should we panic about it. It is a shared consciousness
    that our institutions have failed and our ecosystem is collapsing,
    yet we are still here — and we are creative agents who can shape
    our destinies. Apocalyptic civics is the conviction that the only
    way out is through, and the only way through is together. "

    /Greg Bloom/ @greggish
    https://twitter.com/greggish/status/873177525903609857
    <https://twitter.com/greggish/status/873177525903609857>

    On 28 July 2017 at 10:47, Lachlan Musicman <data...@gmail.com
    <mailto:data...@gmail.com>> wrote:

        I think it's because hostname is so undemanding.
        How many CPUs does each host have?
        You may need to use ((number of cpus per host) + 1) to see
        action on another node.
        You can try using stress-ng to test higher loads?

        
https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
        
<https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/>
        cheers
        L.

        ------
        "The antidote to apocalypticism is *apocalyptic civics*.
        Apocalyptic civics is the insistence that we cannot ignore the
        truth, nor should we panic about it. It is a shared
        consciousness that our institutions have failed and our
        ecosystem is collapsing, yet we are still here — and we are
        creative agents who can shape our destinies. Apocalyptic
        civics is the conviction that the only way out is through, and
        the only way through is together. "

        /Greg Bloom/ @greggish
        https://twitter.com/greggish/status/873177525903609857
        <https://twitter.com/greggish/status/873177525903609857>

        On 28 July 2017 at 10:28, 허웅 <hoewoongg...@naver.com
        <mailto:hoewoongg...@naver.com>> wrote:

            I have 5 nodes include control node.

            and my nodes are looking like this

            Control Node : GO1
            Compute Nodes : GO[1-5]

            when i trying to allocate some job to multiple nodes, only
            one node works.

            example]

            $ srun -N5 hostname
            GO1
            GO1
            GO1
            GO1
            GO1

            even I expected like this

            $ srun -N5 hostname
            GO1
            GO2
            GO3
            GO4
            GO5

            What should i do?

            there are some my configures.

            $ scontrol show frontend
            FrontendName=GO1 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-06-02T20:14:39
            SlurmdStartTime=2017-07-27T16:29:46

            FrontendName=GO2 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:54:13
            SlurmdStartTime=2017-07-27T16:30:07

            FrontendName=GO3 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:22:58
            SlurmdStartTime=2017-07-27T16:30:08

            FrontendName=GO4 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:21:40
            SlurmdStartTime=2017-07-27T16:30:08

            FrontendName=GO5 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:21:39
            SlurmdStartTime=2017-07-27T16:30:09

            $ scontrol ping
            Slurmctld(primary/backup) at GO1/(NULL) are UP/DOWN

            [slurm.conf]
            # slurm.conf
            #
            # See the slurm.conf man page for more information.
            #
            ClusterName=linux
            ControlMachine=GO1
            ControlAddr=192.168.30.74
            #
            SlurmUser=slurm
            SlurmctldPort=6817
            SlurmdPort=6818
            AuthType=auth/munge
            StateSaveLocation=/var/lib/slurmd
            SlurmdSpoolDir=/var/spool/slurmd
            SwitchType=switch/none
            MpiDefault=none
            SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
            SlurmdPidFile=/var/run/slurmd/slurmd.pid
            ProctrackType=proctrack/pgid
            ReturnToService=0
            TreeWidth=50
            #
            # TIMERS
            SlurmctldTimeout=300
            SlurmdTimeout=300
            InactiveLimit=0
            MinJobAge=300
            KillWait=30
            Waittime=0
            #
            # SCHEDULING
            SchedulerType=sched/backfill
            FastSchedule=1
            #
            # LOGGING
            SlurmctldDebug=7
            SlurmctldLogFile=/var/log/slurmctld.log
            SlurmdDebug=7
            SlurmdLogFile=/var/log/slurmd.log
            JobCompType=jobcomp/none
            #
            # COMPUTE NODES
            NodeName=sgo[1-5] NodeHostName=GO[1-5]
            #NodeAddr=192.168.30.[74,141,68,70,72]
            #
            # PARTITIONS
            PartitionName=party Default=yes Nodes=ALL

[slurm-dev] Re: Why my slurm is running on only one node?

Reply via email to