Our setup is similar, slurmctld is gigE only VM while we have nodes that
are gigE + IB.  For our IB nodes when we run things like OpenMPI we use
this before the srun command to force IB usage.

export OMPI_MCA_btl="self,openib"

We also exclude specific interfaces from being used by OpenMPI by setting
"btl_tcp_if_exclude = lo,eth0" in
$OPENMPI_PREFIX/etc/openmpi-mca-params.conf.  Our systems all use eth1 for
their non-IB traffic.  You may be able to set something like "btl =
self,openib" in openmpi-mca-params.conf if all your nodes have IB.

Checking the ulimits from within a job environment may also shed some light
on the problem.

We set the following limits in /etc/sysconfig/slurm, which is sourced by
the slurm init.d script used on our nodes.

ulimit -l unlimited
ulimit -n 8192
ulimit -s unlimited

The "-l" value is to support IB and "-s" has been to support some MPI
applications that would segfault without it.  The largest MPI jobs we
typically see are 240 tasks (30 nodes , 8 CPUs/node) to 256 tasks (8 nodes,
32 CPUs/node).

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Wed, Sep 23, 2015 at 8:05 PM, Timothy Brown <[email protected]
> wrote:

> I tried a couple more things this afternoon.
>
> A 250 node job (12 tasks per node), however before running srun, I set
> PMI_TIME=4000, this is the error I received:
>
> size = 3000, rank = 2853
> size = 3000, rank = 1853
> size = 3000, rank = 2353
> size = 3000, rank = 853
> srun: error: timeout waiting for task launch, started 2976 of 3000 tasks
> srun: Job step 977359.1 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
>
> So digging more into this. I figured out system/topology.
>
> slurm[1,2] are on a VM, connected with gigE.
> nodes have gigE and IB (their hostnames are nodeXXXX, nodeXXXXib).
> Slurm allocates jobs ($SLURM_NODELIST) to nodeXXXX.
>
> A tcupdump of the job startup shows it's all going over eth0, not ib0.
>
> Could this possibly be it?
>
> In looking at Stampede (the only other large Slurm deployment I have
> access to) it seems everything is over IB.
> Also in comparing our /pro/sys/net (sysctl) we are almost identical.
>
> Thoughts? Comments?
>
> Thanks
> Timothy
>
> > On Sep 23, 2015, at 12:14 PM, Timothy Brown <[email protected]>
> wrote:
> >
> > Hi Chansup,
> >
> > Yes, that's way up there too:
> >
> > node0202 ~$ ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 185698
> > max locked memory       (kbytes, -l) unlimited
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 1048576
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 2048000
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) 399360
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> >
> > Interestingly enough I wrote a (ugly hack) bit of code to open sockets
> in C, it just
> > - socket()
> > - bind()
> > - listen()
> > Nothing is ever read/written and I am able to open 28226 sockets on a
> node.
> >
> > I dare say the software stack looks good. Could it possibly be hardware
> related? Bad IB or???? (I'm starting to grasp at straws here!).
> >
> > Thanks
> > Timothy
> >
> >> On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote:
> >>
> >> Hi Tim,
> >>
> >> I'm not sure if you've check the "ulimit -n" value for the user who
> runs the job.
> >> In my experience, I had to bump up the limit much higher than the
> default 1024.
> >>
> >> Just my 2 cents,
> >> - Chansup
> >>
> >> On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown <
> [email protected]> wrote:
> >> Hi Moe and Antony,
> >>
> >> Thanks for the link. On further thinking, I think your right in saying
> it's on the Linux network side. In looking at our system we have:
> >> /proc/sys/fs/file-max: 2346778
> >> /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
> >> /proc/sys/net/core/somaxconn: 128
> >>
> >> So we bumped somaxconn up to 2048 across the cluster.
> >>
> >> Of course we are in production so I had to wait my (slightly
> prioritized) time in the queue.
> >>
> >> However the jobs still fail with the same error:
> >>
> >> srun: error: Task launch for 976058.0 failed on node node1505: Socket
> timed out on send/recv operation
> >>
> >> Out of the 500 nodes, we get this error (in this case) for 158 nodes
> (in case numbers are helpful).
> >>
> >> The only other idea I have is related to total TCP memory, we currently
> have it set to:
> >>
> >> /proc/sys/net/ipv4/tcp_mem
> >> 2228352 2971136 4456704
> >>
> >> Which I interpret approximately as 8G, 11G and 17G, while each node has
> a total of 24G of ram. So I'm thinking these values are ok. However looking
> at other clusters (Stampede) it's set to:
> >>
> >> c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
> >> 16777216        16777216        16777216
> >>
> >> Which I interpret as 64G (wow!) while they have 32G per node. So am I
> interpreting tcp_mem incorrectly?
> >>
> >> I'm currently waiting in the queue again, but will try this 16777216
> for all values in tcp_mem when I get a job running.
> >>
> >> We have a txqueuelen of 1024 for the IB interfaces and I don't want to
> touch that.
> >>
> >> Just about everything else I check in proc regarding the network seems
> ok.
> >>
> >> Do anybody have any further thoughts or pointers? Thanks!
> >>
> >> Timothy
> >>
> >>
> >>> On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
> >>>
> >>>
> >>> I suspect that you are hitting some Linux system limit, such as open
> files, or socket backlog. For information on how to address, see:
> >>> http://slurm.schedmd.com/big_sys.html
> >>>
> >>>
> >>> Quoting Timothy Brown <[email protected]>:
> >>>
> >>>> Hi Moe,
> >>>>
> >>>>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> What version of Slurm?
> >>>>
> >>>> We're currently running 14.11.7
> >>>>
> >>>>> How many tasks/ranks in your job?
> >>>>
> >>>> I've been trying 500 nodes with 12 tasks per node, giving a total of
> 6000. Although after this failed I started
> >>>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It
> seems anything over 300 to touch and go.
> >>>>
> >>>>
> >>>>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
> >>>>
> >>>> Not reliably.
> >>>>
> >>>> $ cat hostname.sh
> >>>> #!/bin/bash
> >>>> #
> >>>> #SBATCH --job-name=OSU_Int
> >>>> #SBATCH --qos=admin
> >>>> #SBATCH --time=00:15:00
> >>>> #SBATCH --nodes=500
> >>>> #SBATCH --ntasks-per-node=12
> >>>> #SBATCH --account=crcbenchmark
> >>>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
> >>>>
> >>>>
> >>>> srun hostname
> >>>>
> >>>> $ sbatch hostname.sh
> >>>> Submitted batch job 976034
> >>>> $ wc -l hostname_976034.txt
> >>>> 5992 hostname_976034.txt
> >>>> $ grep -v ^node hostname_976034.txt
> >>>> srun: error: Task launch for 976034.0 failed on node node0453: Socket
> timed out on send/recv operation
> >>>> srun: error: Application launch failed: Socket timed out on send/recv
> operation
> >>>> srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
> >>>> srun: error: Timed out waiting for job step to complete
> >>>>
> >>>> Any thoughts?
> >>>>
> >>>> Thanks
> >>>> Timothy
> >>>>
> >>>>>
> >>>>> Quoting Ralph Castain <[email protected]>:
> >>>>>> This sounds like something in Slurm - I don’t know how srun would
> know to emit a message if the app was failing to open a socket between its
> own procs.
> >>>>>>
> >>>>>> Try starting the OMPI job with “mpirun” instead of srun and see if
> it has the same issue. If not, then that’s pretty convincing that it’s
> slurm.
> >>>>>>
> >>>>>>
> >>>>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown
> <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi Chris,
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <
> [email protected]> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 22/09/15 07:17, Timothy Brown wrote:
> >>>>>>>>
> >>>>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
> >>>>>>>>
> >>>>>>>> Have you tried Intel MPI's native PMI start up mode?
> >>>>>>>>
> >>>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY
> to the
> >>>>>>>> path to the Slurm libpmi.so file and then you should be able to
> use srun
> >>>>>>>> to launch your job instead.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Yeap, to the same effect. Here's what it gives:
> >>>>>>>
> >>>>>>> srun --mpi=pmi2
> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> >>>>>>> srun: error: Task launch for 973564.0 failed on node node0453:
> Socket timed out on send/recv operation
> >>>>>>> srun: error: Application launch failed: Socket timed out on
> send/recv operation
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> More here:
> >>>>>>>>
> >>>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
> >>>>>>>>
> >>>>>>>>> If I switch to OpenMPI the error is:
> >>>>>>>>
> >>>>>>>> Which version, and was it build with --with-slurm and (if you're
> >>>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> >>>>>>>
> >>>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to
> EasyBuild). Yes we included PMI and the Slurm option. Our configure
> statement was:
> >>>>>>>
> >>>>>>> module purge
> >>>>>>> module load slurm/slurm
> >>>>>>> module load gcc/5.1.0
> >>>>>>> ./configure  \
> >>>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
> >>>>>>> --with-threads=posix \
> >>>>>>> --enable-mpi-thread-multiple \
> >>>>>>> --with-slurm \
> >>>>>>> --with-pmi=/curc/slurm/slurm/current/ \
> >>>>>>> --enable-static \
> >>>>>>> --enable-wrapper-rpath \
> >>>>>>> --enable-sensors \
> >>>>>>> --enable-mpi-ext=all \
> >>>>>>> --with-verbs
> >>>>>>>
> >>>>>>> It's got me scratching my head, as I started off thinking it was
> an MPI issue, spent awhile getting Intel's hydra and OpenMPI's oob to go
> over IB instead of gig-e. This increased the success rate, but we were
> still failing.
> >>>>>>>
> >>>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini),
> which worked a lot of the times. Which made me think it was MPI again!
> However that fails enough to say it's not MPI. The PMI v2 code I wrote,
> gives the wrong results for rank and world size, so I'm sweeping that under
> the rug until I understand it!
> >>>>>>>
> >>>>>>> Just wondering if anybody has seen anything like this. Am happy to
> share our conf file if that helps.
> >>>>>>>
> >>>>>>> The only other thing I could possibly point a finger at (but don't
> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
> >>>>>>>
> >>>>>>> I'm half thinking of opening a TT, but was hoping to get more
> information (and possibly not increase the logging of slurm, which is my
> only next idea).
> >>>>>>>
> >>>>>>> Thanks for your thoughts Chris.
> >>>>>>>
> >>>>>>> Timothy=
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Morris "Moe" Jette
> >>>>> CTO, SchedMD LLC
> >>>>> Commercial Slurm Development and Support
> >>>
> >>>
> >>> --
> >>> Morris "Moe" Jette
> >>> CTO, SchedMD LLC
> >>> Commercial Slurm Development and Support
> >>
> >>
> >
>
>

Reply via email to