Hi Chansup, Yes, that's way up there too:
node0202 ~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 185698 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1048576 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 2048000 cpu time (seconds, -t) unlimited max user processes (-u) 399360 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Interestingly enough I wrote a (ugly hack) bit of code to open sockets in C, it just - socket() - bind() - listen() Nothing is ever read/written and I am able to open 28226 sockets on a node. I dare say the software stack looks good. Could it possibly be hardware related? Bad IB or???? (I'm starting to grasp at straws here!). Thanks Timothy > On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote: > > Hi Tim, > > I'm not sure if you've check the "ulimit -n" value for the user who runs the > job. > In my experience, I had to bump up the limit much higher than the default > 1024. > > Just my 2 cents, > - Chansup > > On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown <[email protected]> > wrote: > Hi Moe and Antony, > > Thanks for the link. On further thinking, I think your right in saying it's > on the Linux network side. In looking at our system we have: > /proc/sys/fs/file-max: 2346778 > /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048 > /proc/sys/net/core/somaxconn: 128 > > So we bumped somaxconn up to 2048 across the cluster. > > Of course we are in production so I had to wait my (slightly prioritized) > time in the queue. > > However the jobs still fail with the same error: > > srun: error: Task launch for 976058.0 failed on node node1505: Socket timed > out on send/recv operation > > Out of the 500 nodes, we get this error (in this case) for 158 nodes (in case > numbers are helpful). > > The only other idea I have is related to total TCP memory, we currently have > it set to: > > /proc/sys/net/ipv4/tcp_mem > 2228352 2971136 4456704 > > Which I interpret approximately as 8G, 11G and 17G, while each node has a > total of 24G of ram. So I'm thinking these values are ok. However looking at > other clusters (Stampede) it's set to: > > c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem > 16777216 16777216 16777216 > > Which I interpret as 64G (wow!) while they have 32G per node. So am I > interpreting tcp_mem incorrectly? > > I'm currently waiting in the queue again, but will try this 16777216 for all > values in tcp_mem when I get a job running. > > We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch > that. > > Just about everything else I check in proc regarding the network seems ok. > > Do anybody have any further thoughts or pointers? Thanks! > > Timothy > > > > On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote: > > > > > > I suspect that you are hitting some Linux system limit, such as open files, > > or socket backlog. For information on how to address, see: > > http://slurm.schedmd.com/big_sys.html > > > > > > Quoting Timothy Brown <[email protected]>: > > > >> Hi Moe, > >> > >>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote: > >>> > >>> > >>> What version of Slurm? > >> > >> We're currently running 14.11.7 > >> > >>> How many tasks/ranks in your job? > >> > >> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. > >> Although after this failed I started > >> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems > >> anything over 300 to touch and go. > >> > >> > >>> Can you run a non-MPI job of the same size (i.e. srun hostname)? > >> > >> Not reliably. > >> > >> $ cat hostname.sh > >> #!/bin/bash > >> # > >> #SBATCH --job-name=OSU_Int > >> #SBATCH --qos=admin > >> #SBATCH --time=00:15:00 > >> #SBATCH --nodes=500 > >> #SBATCH --ntasks-per-node=12 > >> #SBATCH --account=crcbenchmark > >> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt > >> > >> > >> srun hostname > >> > >> $ sbatch hostname.sh > >> Submitted batch job 976034 > >> $ wc -l hostname_976034.txt > >> 5992 hostname_976034.txt > >> $ grep -v ^node hostname_976034.txt > >> srun: error: Task launch for 976034.0 failed on node node0453: Socket > >> timed out on send/recv operation > >> srun: error: Application launch failed: Socket timed out on send/recv > >> operation > >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > >> srun: error: Timed out waiting for job step to complete > >> > >> Any thoughts? > >> > >> Thanks > >> Timothy > >> > >>> > >>> Quoting Ralph Castain <[email protected]>: > >>>> This sounds like something in Slurm - I don’t know how srun would know > >>>> to emit a message if the app was failing to open a socket between its > >>>> own procs. > >>>> > >>>> Try starting the OMPI job with “mpirun” instead of srun and see if it > >>>> has the same issue. If not, then that’s pretty convincing that it’s > >>>> slurm. > >>>> > >>>> > >>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown > >>>>> <[email protected]> wrote: > >>>>> > >>>>> > >>>>> Hi Chris, > >>>>> > >>>>> > >>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel > >>>>>> <[email protected]> wrote: > >>>>>> > >>>>>> > >>>>>> On 22/09/15 07:17, Timothy Brown wrote: > >>>>>> > >>>>>>> This is using mpiexec.hydra with slurm as the bootstrap. > >>>>>> > >>>>>> Have you tried Intel MPI's native PMI start up mode? > >>>>>> > >>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the > >>>>>> path to the Slurm libpmi.so file and then you should be able to use > >>>>>> srun > >>>>>> to launch your job instead. > >>>>>> > >>>>> > >>>>> Yeap, to the same effect. Here's what it gives: > >>>>> > >>>>> srun --mpi=pmi2 > >>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall > >>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket > >>>>> timed out on send/recv operation > >>>>> srun: error: Application launch failed: Socket timed out on send/recv > >>>>> operation > >>>>> > >>>>> > >>>>> > >>>>>> More here: > >>>>>> > >>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun > >>>>>> > >>>>>>> If I switch to OpenMPI the error is: > >>>>>> > >>>>>> Which version, and was it build with --with-slurm and (if you're > >>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ? > >>>>> > >>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to > >>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure > >>>>> statement was: > >>>>> > >>>>> module purge > >>>>> module load slurm/slurm > >>>>> module load gcc/5.1.0 > >>>>> ./configure \ > >>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \ > >>>>> --with-threads=posix \ > >>>>> --enable-mpi-thread-multiple \ > >>>>> --with-slurm \ > >>>>> --with-pmi=/curc/slurm/slurm/current/ \ > >>>>> --enable-static \ > >>>>> --enable-wrapper-rpath \ > >>>>> --enable-sensors \ > >>>>> --enable-mpi-ext=all \ > >>>>> --with-verbs > >>>>> > >>>>> It's got me scratching my head, as I started off thinking it was an MPI > >>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over > >>>>> IB instead of gig-e. This increased the success rate, but we were still > >>>>> failing. > >>>>> > >>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which > >>>>> worked a lot of the times. Which made me think it was MPI again! > >>>>> However that fails enough to say it's not MPI. The PMI v2 code I wrote, > >>>>> gives the wrong results for rank and world size, so I'm sweeping that > >>>>> under the rug until I understand it! > >>>>> > >>>>> Just wondering if anybody has seen anything like this. Am happy to > >>>>> share our conf file if that helps. > >>>>> > >>>>> The only other thing I could possibly point a finger at (but don't > >>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E. > >>>>> > >>>>> I'm half thinking of opening a TT, but was hoping to get more > >>>>> information (and possibly not increase the logging of slurm, which is > >>>>> my only next idea). > >>>>> > >>>>> Thanks for your thoughts Chris. > >>>>> > >>>>> Timothy= > >>> > >>> > >>> -- > >>> Morris "Moe" Jette > >>> CTO, SchedMD LLC > >>> Commercial Slurm Development and Support > > > > > > -- > > Morris "Moe" Jette > > CTO, SchedMD LLC > > Commercial Slurm Development and Support > >
