I tried a couple more things this afternoon. A 250 node job (12 tasks per node), however before running srun, I set PMI_TIME=4000, this is the error I received:
size = 3000, rank = 2853 size = 3000, rank = 1853 size = 3000, rank = 2353 size = 3000, rank = 853 srun: error: timeout waiting for task launch, started 2976 of 3000 tasks srun: Job step 977359.1 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: Timed out waiting for job step to complete So digging more into this. I figured out system/topology. slurm[1,2] are on a VM, connected with gigE. nodes have gigE and IB (their hostnames are nodeXXXX, nodeXXXXib). Slurm allocates jobs ($SLURM_NODELIST) to nodeXXXX. A tcupdump of the job startup shows it's all going over eth0, not ib0. Could this possibly be it? In looking at Stampede (the only other large Slurm deployment I have access to) it seems everything is over IB. Also in comparing our /pro/sys/net (sysctl) we are almost identical. Thoughts? Comments? Thanks Timothy > On Sep 23, 2015, at 12:14 PM, Timothy Brown <[email protected]> > wrote: > > Hi Chansup, > > Yes, that's way up there too: > > node0202 ~$ ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 185698 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 1048576 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 2048000 > cpu time (seconds, -t) unlimited > max user processes (-u) 399360 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Interestingly enough I wrote a (ugly hack) bit of code to open sockets in C, > it just > - socket() > - bind() > - listen() > Nothing is ever read/written and I am able to open 28226 sockets on a node. > > I dare say the software stack looks good. Could it possibly be hardware > related? Bad IB or???? (I'm starting to grasp at straws here!). > > Thanks > Timothy > >> On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote: >> >> Hi Tim, >> >> I'm not sure if you've check the "ulimit -n" value for the user who runs the >> job. >> In my experience, I had to bump up the limit much higher than the default >> 1024. >> >> Just my 2 cents, >> - Chansup >> >> On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown >> <[email protected]> wrote: >> Hi Moe and Antony, >> >> Thanks for the link. On further thinking, I think your right in saying it's >> on the Linux network side. In looking at our system we have: >> /proc/sys/fs/file-max: 2346778 >> /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048 >> /proc/sys/net/core/somaxconn: 128 >> >> So we bumped somaxconn up to 2048 across the cluster. >> >> Of course we are in production so I had to wait my (slightly prioritized) >> time in the queue. >> >> However the jobs still fail with the same error: >> >> srun: error: Task launch for 976058.0 failed on node node1505: Socket timed >> out on send/recv operation >> >> Out of the 500 nodes, we get this error (in this case) for 158 nodes (in >> case numbers are helpful). >> >> The only other idea I have is related to total TCP memory, we currently have >> it set to: >> >> /proc/sys/net/ipv4/tcp_mem >> 2228352 2971136 4456704 >> >> Which I interpret approximately as 8G, 11G and 17G, while each node has a >> total of 24G of ram. So I'm thinking these values are ok. However looking at >> other clusters (Stampede) it's set to: >> >> c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem >> 16777216 16777216 16777216 >> >> Which I interpret as 64G (wow!) while they have 32G per node. So am I >> interpreting tcp_mem incorrectly? >> >> I'm currently waiting in the queue again, but will try this 16777216 for all >> values in tcp_mem when I get a job running. >> >> We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch >> that. >> >> Just about everything else I check in proc regarding the network seems ok. >> >> Do anybody have any further thoughts or pointers? Thanks! >> >> Timothy >> >> >>> On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote: >>> >>> >>> I suspect that you are hitting some Linux system limit, such as open files, >>> or socket backlog. For information on how to address, see: >>> http://slurm.schedmd.com/big_sys.html >>> >>> >>> Quoting Timothy Brown <[email protected]>: >>> >>>> Hi Moe, >>>> >>>>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote: >>>>> >>>>> >>>>> What version of Slurm? >>>> >>>> We're currently running 14.11.7 >>>> >>>>> How many tasks/ranks in your job? >>>> >>>> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. >>>> Although after this failed I started >>>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems >>>> anything over 300 to touch and go. >>>> >>>> >>>>> Can you run a non-MPI job of the same size (i.e. srun hostname)? >>>> >>>> Not reliably. >>>> >>>> $ cat hostname.sh >>>> #!/bin/bash >>>> # >>>> #SBATCH --job-name=OSU_Int >>>> #SBATCH --qos=admin >>>> #SBATCH --time=00:15:00 >>>> #SBATCH --nodes=500 >>>> #SBATCH --ntasks-per-node=12 >>>> #SBATCH --account=crcbenchmark >>>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt >>>> >>>> >>>> srun hostname >>>> >>>> $ sbatch hostname.sh >>>> Submitted batch job 976034 >>>> $ wc -l hostname_976034.txt >>>> 5992 hostname_976034.txt >>>> $ grep -v ^node hostname_976034.txt >>>> srun: error: Task launch for 976034.0 failed on node node0453: Socket >>>> timed out on send/recv operation >>>> srun: error: Application launch failed: Socket timed out on send/recv >>>> operation >>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>>> srun: error: Timed out waiting for job step to complete >>>> >>>> Any thoughts? >>>> >>>> Thanks >>>> Timothy >>>> >>>>> >>>>> Quoting Ralph Castain <[email protected]>: >>>>>> This sounds like something in Slurm - I don’t know how srun would know >>>>>> to emit a message if the app was failing to open a socket between its >>>>>> own procs. >>>>>> >>>>>> Try starting the OMPI job with “mpirun” instead of srun and see if it >>>>>> has the same issue. If not, then that’s pretty convincing that it’s >>>>>> slurm. >>>>>> >>>>>> >>>>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> Hi Chris, >>>>>>> >>>>>>> >>>>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 22/09/15 07:17, Timothy Brown wrote: >>>>>>>> >>>>>>>>> This is using mpiexec.hydra with slurm as the bootstrap. >>>>>>>> >>>>>>>> Have you tried Intel MPI's native PMI start up mode? >>>>>>>> >>>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the >>>>>>>> path to the Slurm libpmi.so file and then you should be able to use >>>>>>>> srun >>>>>>>> to launch your job instead. >>>>>>>> >>>>>>> >>>>>>> Yeap, to the same effect. Here's what it gives: >>>>>>> >>>>>>> srun --mpi=pmi2 >>>>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall >>>>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket >>>>>>> timed out on send/recv operation >>>>>>> srun: error: Application launch failed: Socket timed out on send/recv >>>>>>> operation >>>>>>> >>>>>>> >>>>>>> >>>>>>>> More here: >>>>>>>> >>>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun >>>>>>>> >>>>>>>>> If I switch to OpenMPI the error is: >>>>>>>> >>>>>>>> Which version, and was it build with --with-slurm and (if you're >>>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ? >>>>>>> >>>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to >>>>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure >>>>>>> statement was: >>>>>>> >>>>>>> module purge >>>>>>> module load slurm/slurm >>>>>>> module load gcc/5.1.0 >>>>>>> ./configure \ >>>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \ >>>>>>> --with-threads=posix \ >>>>>>> --enable-mpi-thread-multiple \ >>>>>>> --with-slurm \ >>>>>>> --with-pmi=/curc/slurm/slurm/current/ \ >>>>>>> --enable-static \ >>>>>>> --enable-wrapper-rpath \ >>>>>>> --enable-sensors \ >>>>>>> --enable-mpi-ext=all \ >>>>>>> --with-verbs >>>>>>> >>>>>>> It's got me scratching my head, as I started off thinking it was an MPI >>>>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over >>>>>>> IB instead of gig-e. This increased the success rate, but we were still >>>>>>> failing. >>>>>>> >>>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which >>>>>>> worked a lot of the times. Which made me think it was MPI again! >>>>>>> However that fails enough to say it's not MPI. The PMI v2 code I wrote, >>>>>>> gives the wrong results for rank and world size, so I'm sweeping that >>>>>>> under the rug until I understand it! >>>>>>> >>>>>>> Just wondering if anybody has seen anything like this. Am happy to >>>>>>> share our conf file if that helps. >>>>>>> >>>>>>> The only other thing I could possibly point a finger at (but don't >>>>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E. >>>>>>> >>>>>>> I'm half thinking of opening a TT, but was hoping to get more >>>>>>> information (and possibly not increase the logging of slurm, which is >>>>>>> my only next idea). >>>>>>> >>>>>>> Thanks for your thoughts Chris. >>>>>>> >>>>>>> Timothy= >>>>> >>>>> >>>>> -- >>>>> Morris "Moe" Jette >>>>> CTO, SchedMD LLC >>>>> Commercial Slurm Development and Support >>> >>> >>> -- >>> Morris "Moe" Jette >>> CTO, SchedMD LLC >>> Commercial Slurm Development and Support >> >> >
