[slurm-dev] Re: Large job socket timed out errors.

Timothy Brown Wed, 23 Sep 2015 11:14:03 -0700

Hi Chansup,

Yes, that's way up there too:


node0202 ~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 185698
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 2048000
cpu time               (seconds, -t) unlimited
max user processes              (-u) 399360
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Interestingly enough I wrote a (ugly hack) bit of code to open sockets in C, it 
just
- socket()
- bind()
- listen()
Nothing is ever read/written and I am able to open 28226 sockets on a node. 

I dare say the software stack looks good. Could it possibly be hardware 
related? Bad IB or???? (I'm starting to grasp at straws here!).

Thanks
Timothy

> On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote:
> 
> Hi Tim,
> 
> I'm not sure if you've check the "ulimit -n" value for the user who runs the 
> job.
> In my experience, I had to bump up the limit much higher than the default 
> 1024.
> 
> Just my 2 cents,
> - Chansup
> 
> On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown <[email protected]> 
> wrote:
> Hi Moe and Antony,
> 
> Thanks for the link. On further thinking, I think your right in saying it's 
> on the Linux network side. In looking at our system we have:
> /proc/sys/fs/file-max: 2346778
> /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
> /proc/sys/net/core/somaxconn: 128
> 
> So we bumped somaxconn up to 2048 across the cluster.
> 
> Of course we are in production so I had to wait my (slightly prioritized) 
> time in the queue.
> 
> However the jobs still fail with the same error:
> 
> srun: error: Task launch for 976058.0 failed on node node1505: Socket timed 
> out on send/recv operation
> 
> Out of the 500 nodes, we get this error (in this case) for 158 nodes (in case 
> numbers are helpful).
> 
> The only other idea I have is related to total TCP memory, we currently have 
> it set to:
> 
> /proc/sys/net/ipv4/tcp_mem
> 2228352 2971136 4456704
> 
> Which I interpret approximately as 8G, 11G and 17G, while each node has a 
> total of 24G of ram. So I'm thinking these values are ok. However looking at 
> other clusters (Stampede) it's set to:
> 
> c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
> 16777216        16777216        16777216
> 
> Which I interpret as 64G (wow!) while they have 32G per node. So am I 
> interpreting tcp_mem incorrectly?
> 
> I'm currently waiting in the queue again, but will try this 16777216 for all 
> values in tcp_mem when I get a job running.
> 
> We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch 
> that.
> 
> Just about everything else I check in proc regarding the network seems ok.
> 
> Do anybody have any further thoughts or pointers? Thanks!
> 
> Timothy
> 
> 
> > On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
> >
> >
> > I suspect that you are hitting some Linux system limit, such as open files, 
> > or socket backlog. For information on how to address, see:
> > http://slurm.schedmd.com/big_sys.html
> >
> >
> > Quoting Timothy Brown <[email protected]>:
> >
> >> Hi Moe,
> >>
> >>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
> >>>
> >>>
> >>> What version of Slurm?
> >>
> >> We're currently running 14.11.7
> >>
> >>> How many tasks/ranks in your job?
> >>
> >> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. 
> >> Although after this failed I started
> >> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems 
> >> anything over 300 to touch and go.
> >>
> >>
> >>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
> >>
> >> Not reliably.
> >>
> >> $ cat hostname.sh
> >> #!/bin/bash
> >> #
> >> #SBATCH --job-name=OSU_Int
> >> #SBATCH --qos=admin
> >> #SBATCH --time=00:15:00
> >> #SBATCH --nodes=500
> >> #SBATCH --ntasks-per-node=12
> >> #SBATCH --account=crcbenchmark
> >> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
> >>
> >>
> >> srun hostname
> >>
> >> $ sbatch hostname.sh
> >> Submitted batch job 976034
> >> $ wc -l hostname_976034.txt
> >> 5992 hostname_976034.txt
> >> $ grep -v ^node hostname_976034.txt
> >> srun: error: Task launch for 976034.0 failed on node node0453: Socket 
> >> timed out on send/recv operation
> >> srun: error: Application launch failed: Socket timed out on send/recv 
> >> operation
> >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> >> srun: error: Timed out waiting for job step to complete
> >>
> >> Any thoughts?
> >>
> >> Thanks
> >> Timothy
> >>
> >>>
> >>> Quoting Ralph Castain <[email protected]>:
> >>>> This sounds like something in Slurm - I don’t know how srun would know 
> >>>> to emit a message if the app was failing to open a socket between its 
> >>>> own procs.
> >>>>
> >>>> Try starting the OMPI job with “mpirun” instead of srun and see if it 
> >>>> has the same issue. If not, then that’s pretty convincing that it’s 
> >>>> slurm.
> >>>>
> >>>>
> >>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown 
> >>>>> <[email protected]> wrote:
> >>>>>
> >>>>>
> >>>>> Hi Chris,
> >>>>>
> >>>>>
> >>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel 
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 22/09/15 07:17, Timothy Brown wrote:
> >>>>>>
> >>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
> >>>>>>
> >>>>>> Have you tried Intel MPI's native PMI start up mode?
> >>>>>>
> >>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
> >>>>>> path to the Slurm libpmi.so file and then you should be able to use 
> >>>>>> srun
> >>>>>> to launch your job instead.
> >>>>>>
> >>>>>
> >>>>> Yeap, to the same effect. Here's what it gives:
> >>>>>
> >>>>> srun --mpi=pmi2 
> >>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> >>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket 
> >>>>> timed out on send/recv operation
> >>>>> srun: error: Application launch failed: Socket timed out on send/recv 
> >>>>> operation
> >>>>>
> >>>>>
> >>>>>
> >>>>>> More here:
> >>>>>>
> >>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
> >>>>>>
> >>>>>>> If I switch to OpenMPI the error is:
> >>>>>>
> >>>>>> Which version, and was it build with --with-slurm and (if you're
> >>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> >>>>>
> >>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to 
> >>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure 
> >>>>> statement was:
> >>>>>
> >>>>> module purge
> >>>>> module load slurm/slurm
> >>>>> module load gcc/5.1.0
> >>>>> ./configure  \
> >>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
> >>>>> --with-threads=posix \
> >>>>> --enable-mpi-thread-multiple \
> >>>>> --with-slurm \
> >>>>> --with-pmi=/curc/slurm/slurm/current/ \
> >>>>> --enable-static \
> >>>>> --enable-wrapper-rpath \
> >>>>> --enable-sensors \
> >>>>> --enable-mpi-ext=all \
> >>>>> --with-verbs
> >>>>>
> >>>>> It's got me scratching my head, as I started off thinking it was an MPI 
> >>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over 
> >>>>> IB instead of gig-e. This increased the success rate, but we were still 
> >>>>> failing.
> >>>>>
> >>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which 
> >>>>> worked a lot of the times. Which made me think it was MPI again! 
> >>>>> However that fails enough to say it's not MPI. The PMI v2 code I wrote, 
> >>>>> gives the wrong results for rank and world size, so I'm sweeping that 
> >>>>> under the rug until I understand it!
> >>>>>
> >>>>> Just wondering if anybody has seen anything like this. Am happy to 
> >>>>> share our conf file if that helps.
> >>>>>
> >>>>> The only other thing I could possibly point a finger at (but don't 
> >>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
> >>>>>
> >>>>> I'm half thinking of opening a TT, but was hoping to get more 
> >>>>> information (and possibly not increase the logging of slurm, which is 
> >>>>> my only next idea).
> >>>>>
> >>>>> Thanks for your thoughts Chris.
> >>>>>
> >>>>> Timothy=
> >>>
> >>>
> >>> --
> >>> Morris "Moe" Jette
> >>> CTO, SchedMD LLC
> >>> Commercial Slurm Development and Support
> >
> >
> > --
> > Morris "Moe" Jette
> > CTO, SchedMD LLC
> > Commercial Slurm Development and Support
> 
>

[slurm-dev] Re: Large job socket timed out errors.

Reply via email to