Hi Moe and Antony, Thanks for the link. On further thinking, I think your right in saying it's on the Linux network side. In looking at our system we have: /proc/sys/fs/file-max: 2346778 /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048 /proc/sys/net/core/somaxconn: 128
So we bumped somaxconn up to 2048 across the cluster. Of course we are in production so I had to wait my (slightly prioritized) time in the queue. However the jobs still fail with the same error: srun: error: Task launch for 976058.0 failed on node node1505: Socket timed out on send/recv operation Out of the 500 nodes, we get this error (in this case) for 158 nodes (in case numbers are helpful). The only other idea I have is related to total TCP memory, we currently have it set to: /proc/sys/net/ipv4/tcp_mem 2228352 2971136 4456704 Which I interpret approximately as 8G, 11G and 17G, while each node has a total of 24G of ram. So I'm thinking these values are ok. However looking at other clusters (Stampede) it's set to: c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem 16777216 16777216 16777216 Which I interpret as 64G (wow!) while they have 32G per node. So am I interpreting tcp_mem incorrectly? I'm currently waiting in the queue again, but will try this 16777216 for all values in tcp_mem when I get a job running. We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch that. Just about everything else I check in proc regarding the network seems ok. Do anybody have any further thoughts or pointers? Thanks! Timothy > On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote: > > > I suspect that you are hitting some Linux system limit, such as open files, > or socket backlog. For information on how to address, see: > http://slurm.schedmd.com/big_sys.html > > > Quoting Timothy Brown <[email protected]>: > >> Hi Moe, >> >>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote: >>> >>> >>> What version of Slurm? >> >> We're currently running 14.11.7 >> >>> How many tasks/ranks in your job? >> >> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. >> Although after this failed I started >> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems >> anything over 300 to touch and go. >> >> >>> Can you run a non-MPI job of the same size (i.e. srun hostname)? >> >> Not reliably. >> >> $ cat hostname.sh >> #!/bin/bash >> # >> #SBATCH --job-name=OSU_Int >> #SBATCH --qos=admin >> #SBATCH --time=00:15:00 >> #SBATCH --nodes=500 >> #SBATCH --ntasks-per-node=12 >> #SBATCH --account=crcbenchmark >> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt >> >> >> srun hostname >> >> $ sbatch hostname.sh >> Submitted batch job 976034 >> $ wc -l hostname_976034.txt >> 5992 hostname_976034.txt >> $ grep -v ^node hostname_976034.txt >> srun: error: Task launch for 976034.0 failed on node node0453: Socket timed >> out on send/recv operation >> srun: error: Application launch failed: Socket timed out on send/recv >> operation >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> srun: error: Timed out waiting for job step to complete >> >> Any thoughts? >> >> Thanks >> Timothy >> >>> >>> Quoting Ralph Castain <[email protected]>: >>>> This sounds like something in Slurm - I don’t know how srun would know to >>>> emit a message if the app was failing to open a socket between its own >>>> procs. >>>> >>>> Try starting the OMPI job with “mpirun” instead of srun and see if it has >>>> the same issue. If not, then that’s pretty convincing that it’s slurm. >>>> >>>> >>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown <[email protected]> >>>>> wrote: >>>>> >>>>> >>>>> Hi Chris, >>>>> >>>>> >>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> On 22/09/15 07:17, Timothy Brown wrote: >>>>>> >>>>>>> This is using mpiexec.hydra with slurm as the bootstrap. >>>>>> >>>>>> Have you tried Intel MPI's native PMI start up mode? >>>>>> >>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the >>>>>> path to the Slurm libpmi.so file and then you should be able to use srun >>>>>> to launch your job instead. >>>>>> >>>>> >>>>> Yeap, to the same effect. Here's what it gives: >>>>> >>>>> srun --mpi=pmi2 >>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall >>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket >>>>> timed out on send/recv operation >>>>> srun: error: Application launch failed: Socket timed out on send/recv >>>>> operation >>>>> >>>>> >>>>> >>>>>> More here: >>>>>> >>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun >>>>>> >>>>>>> If I switch to OpenMPI the error is: >>>>>> >>>>>> Which version, and was it build with --with-slurm and (if you're >>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ? >>>>> >>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to >>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure >>>>> statement was: >>>>> >>>>> module purge >>>>> module load slurm/slurm >>>>> module load gcc/5.1.0 >>>>> ./configure \ >>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \ >>>>> --with-threads=posix \ >>>>> --enable-mpi-thread-multiple \ >>>>> --with-slurm \ >>>>> --with-pmi=/curc/slurm/slurm/current/ \ >>>>> --enable-static \ >>>>> --enable-wrapper-rpath \ >>>>> --enable-sensors \ >>>>> --enable-mpi-ext=all \ >>>>> --with-verbs >>>>> >>>>> It's got me scratching my head, as I started off thinking it was an MPI >>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over IB >>>>> instead of gig-e. This increased the success rate, but we were still >>>>> failing. >>>>> >>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which >>>>> worked a lot of the times. Which made me think it was MPI again! However >>>>> that fails enough to say it's not MPI. The PMI v2 code I wrote, gives the >>>>> wrong results for rank and world size, so I'm sweeping that under the rug >>>>> until I understand it! >>>>> >>>>> Just wondering if anybody has seen anything like this. Am happy to share >>>>> our conf file if that helps. >>>>> >>>>> The only other thing I could possibly point a finger at (but don't >>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E. >>>>> >>>>> I'm half thinking of opening a TT, but was hoping to get more information >>>>> (and possibly not increase the logging of slurm, which is my only next >>>>> idea). >>>>> >>>>> Thanks for your thoughts Chris. >>>>> >>>>> Timothy= >>> >>> >>> -- >>> Morris "Moe" Jette >>> CTO, SchedMD LLC >>> Commercial Slurm Development and Support > > > -- > Morris "Moe" Jette > CTO, SchedMD LLC > Commercial Slurm Development and Support
