Our setup is similar, slurmctld is gigE only VM while we have nodes that are gigE + IB. For our IB nodes when we run things like OpenMPI we use this before the srun command to force IB usage.
export OMPI_MCA_btl="self,openib" We also exclude specific interfaces from being used by OpenMPI by setting "btl_tcp_if_exclude = lo,eth0" in $OPENMPI_PREFIX/etc/openmpi-mca-params.conf. Our systems all use eth1 for their non-IB traffic. You may be able to set something like "btl = self,openib" in openmpi-mca-params.conf if all your nodes have IB. Checking the ulimits from within a job environment may also shed some light on the problem. We set the following limits in /etc/sysconfig/slurm, which is sourced by the slurm init.d script used on our nodes. ulimit -l unlimited ulimit -n 8192 ulimit -s unlimited The "-l" value is to support IB and "-s" has been to support some MPI applications that would segfault without it. The largest MPI jobs we typically see are 240 tasks (30 nodes , 8 CPUs/node) to 256 tasks (8 nodes, 32 CPUs/node). - Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Wed, Sep 23, 2015 at 8:05 PM, Timothy Brown <[email protected] > wrote: > I tried a couple more things this afternoon. > > A 250 node job (12 tasks per node), however before running srun, I set > PMI_TIME=4000, this is the error I received: > > size = 3000, rank = 2853 > size = 3000, rank = 1853 > size = 3000, rank = 2353 > size = 3000, rank = 853 > srun: error: timeout waiting for task launch, started 2976 of 3000 tasks > srun: Job step 977359.1 aborted before step completely launched. > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > > > So digging more into this. I figured out system/topology. > > slurm[1,2] are on a VM, connected with gigE. > nodes have gigE and IB (their hostnames are nodeXXXX, nodeXXXXib). > Slurm allocates jobs ($SLURM_NODELIST) to nodeXXXX. > > A tcupdump of the job startup shows it's all going over eth0, not ib0. > > Could this possibly be it? > > In looking at Stampede (the only other large Slurm deployment I have > access to) it seems everything is over IB. > Also in comparing our /pro/sys/net (sysctl) we are almost identical. > > Thoughts? Comments? > > Thanks > Timothy > > > On Sep 23, 2015, at 12:14 PM, Timothy Brown <[email protected]> > wrote: > > > > Hi Chansup, > > > > Yes, that's way up there too: > > > > node0202 ~$ ulimit -a > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 185698 > > max locked memory (kbytes, -l) unlimited > > max memory size (kbytes, -m) unlimited > > open files (-n) 1048576 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 2048000 > > cpu time (seconds, -t) unlimited > > max user processes (-u) 399360 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > > > Interestingly enough I wrote a (ugly hack) bit of code to open sockets > in C, it just > > - socket() > > - bind() > > - listen() > > Nothing is ever read/written and I am able to open 28226 sockets on a > node. > > > > I dare say the software stack looks good. Could it possibly be hardware > related? Bad IB or???? (I'm starting to grasp at straws here!). > > > > Thanks > > Timothy > > > >> On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote: > >> > >> Hi Tim, > >> > >> I'm not sure if you've check the "ulimit -n" value for the user who > runs the job. > >> In my experience, I had to bump up the limit much higher than the > default 1024. > >> > >> Just my 2 cents, > >> - Chansup > >> > >> On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown < > [email protected]> wrote: > >> Hi Moe and Antony, > >> > >> Thanks for the link. On further thinking, I think your right in saying > it's on the Linux network side. In looking at our system we have: > >> /proc/sys/fs/file-max: 2346778 > >> /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048 > >> /proc/sys/net/core/somaxconn: 128 > >> > >> So we bumped somaxconn up to 2048 across the cluster. > >> > >> Of course we are in production so I had to wait my (slightly > prioritized) time in the queue. > >> > >> However the jobs still fail with the same error: > >> > >> srun: error: Task launch for 976058.0 failed on node node1505: Socket > timed out on send/recv operation > >> > >> Out of the 500 nodes, we get this error (in this case) for 158 nodes > (in case numbers are helpful). > >> > >> The only other idea I have is related to total TCP memory, we currently > have it set to: > >> > >> /proc/sys/net/ipv4/tcp_mem > >> 2228352 2971136 4456704 > >> > >> Which I interpret approximately as 8G, 11G and 17G, while each node has > a total of 24G of ram. So I'm thinking these values are ok. However looking > at other clusters (Stampede) it's set to: > >> > >> c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem > >> 16777216 16777216 16777216 > >> > >> Which I interpret as 64G (wow!) while they have 32G per node. So am I > interpreting tcp_mem incorrectly? > >> > >> I'm currently waiting in the queue again, but will try this 16777216 > for all values in tcp_mem when I get a job running. > >> > >> We have a txqueuelen of 1024 for the IB interfaces and I don't want to > touch that. > >> > >> Just about everything else I check in proc regarding the network seems > ok. > >> > >> Do anybody have any further thoughts or pointers? Thanks! > >> > >> Timothy > >> > >> > >>> On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote: > >>> > >>> > >>> I suspect that you are hitting some Linux system limit, such as open > files, or socket backlog. For information on how to address, see: > >>> http://slurm.schedmd.com/big_sys.html > >>> > >>> > >>> Quoting Timothy Brown <[email protected]>: > >>> > >>>> Hi Moe, > >>>> > >>>>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote: > >>>>> > >>>>> > >>>>> What version of Slurm? > >>>> > >>>> We're currently running 14.11.7 > >>>> > >>>>> How many tasks/ranks in your job? > >>>> > >>>> I've been trying 500 nodes with 12 tasks per node, giving a total of > 6000. Although after this failed I started > >>>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It > seems anything over 300 to touch and go. > >>>> > >>>> > >>>>> Can you run a non-MPI job of the same size (i.e. srun hostname)? > >>>> > >>>> Not reliably. > >>>> > >>>> $ cat hostname.sh > >>>> #!/bin/bash > >>>> # > >>>> #SBATCH --job-name=OSU_Int > >>>> #SBATCH --qos=admin > >>>> #SBATCH --time=00:15:00 > >>>> #SBATCH --nodes=500 > >>>> #SBATCH --ntasks-per-node=12 > >>>> #SBATCH --account=crcbenchmark > >>>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt > >>>> > >>>> > >>>> srun hostname > >>>> > >>>> $ sbatch hostname.sh > >>>> Submitted batch job 976034 > >>>> $ wc -l hostname_976034.txt > >>>> 5992 hostname_976034.txt > >>>> $ grep -v ^node hostname_976034.txt > >>>> srun: error: Task launch for 976034.0 failed on node node0453: Socket > timed out on send/recv operation > >>>> srun: error: Application launch failed: Socket timed out on send/recv > operation > >>>> srun: Job step aborted: Waiting up to 32 seconds for job step to > finish. > >>>> srun: error: Timed out waiting for job step to complete > >>>> > >>>> Any thoughts? > >>>> > >>>> Thanks > >>>> Timothy > >>>> > >>>>> > >>>>> Quoting Ralph Castain <[email protected]>: > >>>>>> This sounds like something in Slurm - I don’t know how srun would > know to emit a message if the app was failing to open a socket between its > own procs. > >>>>>> > >>>>>> Try starting the OMPI job with “mpirun” instead of srun and see if > it has the same issue. If not, then that’s pretty convincing that it’s > slurm. > >>>>>> > >>>>>> > >>>>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown > <[email protected]> wrote: > >>>>>>> > >>>>>>> > >>>>>>> Hi Chris, > >>>>>>> > >>>>>>> > >>>>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel < > [email protected]> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On 22/09/15 07:17, Timothy Brown wrote: > >>>>>>>> > >>>>>>>>> This is using mpiexec.hydra with slurm as the bootstrap. > >>>>>>>> > >>>>>>>> Have you tried Intel MPI's native PMI start up mode? > >>>>>>>> > >>>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY > to the > >>>>>>>> path to the Slurm libpmi.so file and then you should be able to > use srun > >>>>>>>> to launch your job instead. > >>>>>>>> > >>>>>>> > >>>>>>> Yeap, to the same effect. Here's what it gives: > >>>>>>> > >>>>>>> srun --mpi=pmi2 > /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall > >>>>>>> srun: error: Task launch for 973564.0 failed on node node0453: > Socket timed out on send/recv operation > >>>>>>> srun: error: Application launch failed: Socket timed out on > send/recv operation > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> More here: > >>>>>>>> > >>>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun > >>>>>>>> > >>>>>>>>> If I switch to OpenMPI the error is: > >>>>>>>> > >>>>>>>> Which version, and was it build with --with-slurm and (if you're > >>>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ? > >>>>>>> > >>>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to > EasyBuild). Yes we included PMI and the Slurm option. Our configure > statement was: > >>>>>>> > >>>>>>> module purge > >>>>>>> module load slurm/slurm > >>>>>>> module load gcc/5.1.0 > >>>>>>> ./configure \ > >>>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \ > >>>>>>> --with-threads=posix \ > >>>>>>> --enable-mpi-thread-multiple \ > >>>>>>> --with-slurm \ > >>>>>>> --with-pmi=/curc/slurm/slurm/current/ \ > >>>>>>> --enable-static \ > >>>>>>> --enable-wrapper-rpath \ > >>>>>>> --enable-sensors \ > >>>>>>> --enable-mpi-ext=all \ > >>>>>>> --with-verbs > >>>>>>> > >>>>>>> It's got me scratching my head, as I started off thinking it was > an MPI issue, spent awhile getting Intel's hydra and OpenMPI's oob to go > over IB instead of gig-e. This increased the success rate, but we were > still failing. > >>>>>>> > >>>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), > which worked a lot of the times. Which made me think it was MPI again! > However that fails enough to say it's not MPI. The PMI v2 code I wrote, > gives the wrong results for rank and world size, so I'm sweeping that under > the rug until I understand it! > >>>>>>> > >>>>>>> Just wondering if anybody has seen anything like this. Am happy to > share our conf file if that helps. > >>>>>>> > >>>>>>> The only other thing I could possibly point a finger at (but don't > believe it is), is that the slurm masters (slurmctld) are only on gig-E. > >>>>>>> > >>>>>>> I'm half thinking of opening a TT, but was hoping to get more > information (and possibly not increase the logging of slurm, which is my > only next idea). > >>>>>>> > >>>>>>> Thanks for your thoughts Chris. > >>>>>>> > >>>>>>> Timothy= > >>>>> > >>>>> > >>>>> -- > >>>>> Morris "Moe" Jette > >>>>> CTO, SchedMD LLC > >>>>> Commercial Slurm Development and Support > >>> > >>> > >>> -- > >>> Morris "Moe" Jette > >>> CTO, SchedMD LLC > >>> Commercial Slurm Development and Support > >> > >> > > > >
