what if you manually run the hostname command on all your hosts (e.g. do not use slurm) ?

do you get the expected result ?



On 7/28/2017 10:51 AM, 허웅 wrote:

I also have tried to run a job at specific node like:

srun --nodelist=sgo2 hostname

and here is my slurmctld log file

[2017-07-28T10:44:04.975] debug2: sched: Processing RPC: REQUEST_RESOURCE_ALLOCATION from uid=0

[2017-07-28T10:44:04.975] debug3: JobDesc: user_id=0 job_id=N/A partition=(null) name=hostname

[2017-07-28T10:44:04.975] debug3: cpus=1-4294967294 pn_min_cpus=-1 core_spec=-1

[2017-07-28T10:44:04.975] debug3: Nodes=1-[1] Sock/Node=65534 Core/Sock=65534 Thread/Core=65534

[2017-07-28T10:44:04.975] debug3: pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1

[2017-07-28T10:44:04.975] debug3: immediate=0 features=(null) reservation=(null)

[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null) gres=(null)

[2017-07-28T10:44:04.975] debug3: time_limit=-1--1 priority=-1 contiguous=0 shared=-1

[2017-07-28T10:44:04.975] debug3:    kill_on_node_fail=-1 script=(null)

[2017-07-28T10:44:04.975] debug3:    argv="hostname"

[2017-07-28T10:44:04.975] debug3: stdin=(null) stdout=(null) stderr=(null)

[2017-07-28T10:44:04.975] debug3: work_dir=/lustre alloc_node:sid=GO1:31523

[2017-07-28T10:44:04.975] debug3:    power_flags=

[2017-07-28T10:44:04.975] debug3: resp_host=192.168.30.74 alloc_resp_port=57496 other_port=46302

[2017-07-28T10:44:04.975] debug3: dependency=(null) account=(null) qos=(null) comment=(null)

[2017-07-28T10:44:04.975] debug3: mail_type=0 mail_user=(null) nice=0 num_tasks=-1 open_mode=0 overcommit=-1 acctg_freq=(null)

[2017-07-28T10:44:04.975] debug3: network=(null) begin=Unknown cpus_per_task=-1 requeue=-1 licenses=(null)

[2017-07-28T10:44:04.975] debug3: end_time= signal=0@0 wait_all_nodes=1 cpu_freq=

[2017-07-28T10:44:04.975] debug3: ntasks_per_node=-1 ntasks_per_socket=-1 ntasks_per_core=-1

[2017-07-28T10:44:04.975] debug3: mem_bind=65534:(null) plane_size:65534

[2017-07-28T10:44:04.975] debug3:    array_inx=(null)

[2017-07-28T10:44:04.975] debug3:    burst_buffer=(null)

[2017-07-28T10:44:04.975] debug3:    mcs_label=(null)

[2017-07-28T10:44:04.975] debug3:    deadline=Unknown

[2017-07-28T10:44:04.975] debug3:    bitflags=0 delay_boot=4294967294

[2017-07-28T10:44:04.975] debug3: before alteration asking for nodes 1-1 cpus 1-4294967294

[2017-07-28T10:44:04.975] debug3: after alteration asking for nodes 1-1 cpus 1-4294967294

[2017-07-28T10:44:04.975] debug2: found 5 usable nodes from config containing sgo[1-5]

[2017-07-28T10:44:04.975] debug3: _pick_best_nodes: job 160 idle_nodes 5 share_nodes 5

[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated resources: NodeList=sgo2

[2017-07-28T10:44:04.975] sched: _slurm_rpc_allocate_resources JobId=160 NodeList=sgo2 usec=517

[2017-07-28T10:44:04.975] debug3: Writing job id 160 to header record of job_state file

[2017-07-28T10:44:04.976] debug2: _slurm_rpc_job_ready(160)=3 usec=2

[2017-07-28T10:44:04.977] debug3: StepDesc: user_id=0 job_id=160 node_count=1-1 cpu_count=8 num_tasks=1

[2017-07-28T10:44:04.977] debug3: cpu_freq_gov=4294967294 cpu_freq_max=4294967294 cpu_freq_min=4294967294 relative=65534 task_dist=0x2000 plane=65534

[2017-07-28T10:44:04.977] debug3:    node_list=sgo2  constraints=(null)

[2017-07-28T10:44:04.977] debug3: host=GO1 port=55720 srun_pid=30819 name=hostname network=(null) exclusive=0

[2017-07-28T10:44:04.977] debug3: checkpoint-dir=/var/slurm/checkpoint checkpoint_int=0

[2017-07-28T10:44:04.977] debug3: mem_per_node=0 resv_port_cnt=65534 immediate=0 no_kill=0

[2017-07-28T10:44:04.977] debug3:    overcommit=1 time_limit=0 gres=(null)

[2017-07-28T10:44:04.977] debug3: step_layout cpus = 8 pos = 0

[2017-07-28T10:44:04.977] debug: laying out the 1 tasks on 1 hosts sgo2 dist 2

[2017-07-28T10:44:04.994] debug2: Processing RPC: REQUEST_COMPLETE_JOB_ALLOCATION from uid=0, JobId=160 rc=0

[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x1 NodeCnt=1 WEXITSTATUS 0

[2017-07-28T10:44:04.994] job_complete: JobID=160 State=0x8003 NodeCnt=1 done

[2017-07-28T10:44:04.994] debug2: _slurm_rpc_complete_job_allocation: JobID=160 State=0x8003 NodeCnt=1

[2017-07-28T10:44:04.994] debug2: Spawning RPC agent for msg_type SRUN_JOB_COMPLETE

[2017-07-28T10:44:04.994] debug2: got 1 threads to send out

[2017-07-28T10:44:04.994] debug3: slurm_send_only_node_msg: sent 181

[2017-07-28T10:44:04.999] debug2: Spawning RPC agent for msg_type REQUEST_TERMINATE_JOB

[2017-07-28T10:44:04.999] debug2: got 1 threads to send out

[2017-07-28T10:44:04.999] debug2: Tree head got back 0 looking for 1

[2017-07-28T10:44:04.999] debug3: Tree sending to GO1

[2017-07-28T10:44:05.000] debug2: Tree head got back 1

[2017-07-28T10:44:05.004] debug2: node_did_resp GO1

[2017-07-28T10:44:05.005] debug:  sched: Running job scheduler

[2017-07-28T10:44:09.001] debug3: Writing job id 160 to header record of job_state file


in this log,

[2017-07-28T10:44:04.975] debug3: req_nodes=sgo2 exc_nodes=(null) gres=(null)

[2017-07-28T10:44:04.975] debug2: sched: JobId=160 allocated resources: NodeList=sgo2

I checked that req_nodes are correct what i want.

But, the result is GO1 not GO2.

what happened?

-----Original Message-----
*From:* "Said Mohamed Said"<said.moha...@oist.jp>
*To:* "slurm-dev"<slurm-dev@schedmd.com>;
*Cc:*
*Sent:* 2017-07-28 (금) 10:16:44
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?

If you still have a problem, run a job at specific node like [ srun --nodelist=go2 hostname ] and if the command was not successful check the corresponding log file for any errors.


cheers,


Said.

------------------------------------------------------------------------
*From:* Lachlan Musicman <data...@gmail.com>
*Sent:* Friday, July 28, 2017 10:02:57 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: Why my slurm is running on only one node?
Ok! Good, so the servers are there.

You should expect to see output from

srun -w go2 hostname

alternatively you should get a diff hostname if you run

srun --time=0-06:00 --mem=8gb "$@" --pty -u bash -i

for instance.

Try running some stress test with >1 node and #cpus>(#cpus on single node) in request, that should show multiple nodes. Hopefully.

cheers
L.




------
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. "

/Greg Bloom/ @greggish https://twitter.com/greggish/status/873177525903609857

On 28 July 2017 at 10:57, 허웅 <hoewoongg...@naver.com <mailto:hoewoongg...@naver.com>> wrote:

    Here is my output of sinfo

    [root@GO1]~# sinfo -N

    NODELIST   NODES PARTITION STATE

    sgo1           1    party* idle

    sgo2           1    party* idle

    sgo3           1    party* idle

    sgo4           1    party* idle

    sgo5           1    party* idle

    [root@GO1]~# sn
    Fri Jul 28 09:55:53 2017
               HOSTNAMES
                     GO1
                     GO2
                     GO3
                     GO4
                     GO5

    -----Original Message-----
    *From:* "Lachlan Musicman"<data...@gmail.com
    <mailto:data...@gmail.com>>
    *To:* "slurm-dev"<slurm-dev@schedmd.com
    <mailto:slurm-dev@schedmd.com>>;
    *Cc:*
    *Sent:* 2017-07-28 (금) 09:51:40
    *Subject:* [slurm-dev] Re: Why my slurm is running on only one node?

    Also - are the nodes up an running wrt SLURM? What is the output of :
    sinfo -N

    ?
    (fwiw, I really like the alias sn="sinfo -Nle -o "%.20n %.15C %.8O
    %.7t" | uniq" )
    cheers
    L.

    ------
    "The antidote to apocalypticism is *apocalyptic civics*.
    Apocalyptic civics is the insistence that we cannot ignore the
    truth, nor should we panic about it. It is a shared consciousness
    that our institutions have failed and our ecosystem is collapsing,
    yet we are still here — and we are creative agents who can shape
    our destinies. Apocalyptic civics is the conviction that the only
    way out is through, and the only way through is together. "

    /Greg Bloom/ @greggish
    https://twitter.com/greggish/status/873177525903609857
    <https://twitter.com/greggish/status/873177525903609857>

    On 28 July 2017 at 10:47, Lachlan Musicman <data...@gmail.com
    <mailto:data...@gmail.com>> wrote:

        I think it's because hostname is so undemanding.
        How many CPUs does each host have?
        You may need to use ((number of cpus per host) + 1) to see
        action on another node.
        You can try using stress-ng to test higher loads?

        
https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/
        
<https://www.cyberciti.biz/faq/stress-test-linux-unix-server-with-stress-ng/>
        cheers
        L.

        ------
        "The antidote to apocalypticism is *apocalyptic civics*.
        Apocalyptic civics is the insistence that we cannot ignore the
        truth, nor should we panic about it. It is a shared
        consciousness that our institutions have failed and our
        ecosystem is collapsing, yet we are still here — and we are
        creative agents who can shape our destinies. Apocalyptic
        civics is the conviction that the only way out is through, and
        the only way through is together. "

        /Greg Bloom/ @greggish
        https://twitter.com/greggish/status/873177525903609857
        <https://twitter.com/greggish/status/873177525903609857>

        On 28 July 2017 at 10:28, 허웅 <hoewoongg...@naver.com
        <mailto:hoewoongg...@naver.com>> wrote:

            I have 5 nodes include control node.

            and my nodes are looking like this

            Control Node : GO1
            Compute Nodes : GO[1-5]

            when i trying to allocate some job to multiple nodes, only
            one node works.

            example]

            $ srun -N5 hostname
            GO1
            GO1
            GO1
            GO1
            GO1

            even I expected like this

            $ srun -N5 hostname
            GO1
            GO2
            GO3
            GO4
            GO5

            What should i do?

            there are some my configures.

            $ scontrol show frontend
            FrontendName=GO1 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-06-02T20:14:39
            SlurmdStartTime=2017-07-27T16:29:46

            FrontendName=GO2 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:54:13
            SlurmdStartTime=2017-07-27T16:30:07

            FrontendName=GO3 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:22:58
            SlurmdStartTime=2017-07-27T16:30:08

            FrontendName=GO4 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:21:40
            SlurmdStartTime=2017-07-27T16:30:08

            FrontendName=GO5 State=IDLE Version=17.02 Reason=(null)
            BootTime=2017-07-05T17:21:39
            SlurmdStartTime=2017-07-27T16:30:09

            $ scontrol ping
            Slurmctld(primary/backup) at GO1/(NULL) are UP/DOWN

            [slurm.conf]
            # slurm.conf
            #
            # See the slurm.conf man page for more information.
            #
            ClusterName=linux
            ControlMachine=GO1
            ControlAddr=192.168.30.74
            #
            SlurmUser=slurm
            SlurmctldPort=6817
            SlurmdPort=6818
            AuthType=auth/munge
            StateSaveLocation=/var/lib/slurmd
            SlurmdSpoolDir=/var/spool/slurmd
            SwitchType=switch/none
            MpiDefault=none
            SlurmctldPidFile=/var/run/slurmd/slurmctld.pid
            SlurmdPidFile=/var/run/slurmd/slurmd.pid
            ProctrackType=proctrack/pgid
            ReturnToService=0
            TreeWidth=50
            #
            # TIMERS
            SlurmctldTimeout=300
            SlurmdTimeout=300
            InactiveLimit=0
            MinJobAge=300
            KillWait=30
            Waittime=0
            #
            # SCHEDULING
            SchedulerType=sched/backfill
            FastSchedule=1
            #
            # LOGGING
            SlurmctldDebug=7
            SlurmctldLogFile=/var/log/slurmctld.log
            SlurmdDebug=7
            SlurmdLogFile=/var/log/slurmd.log
            JobCompType=jobcomp/none
            #
            # COMPUTE NODES
            NodeName=sgo[1-5] NodeHostName=GO[1-5]
            #NodeAddr=192.168.30.[74,141,68,70,72]
            #
            # PARTITIONS
            PartitionName=party Default=yes Nodes=ALL


Reply via email to