[slurm-dev] Re: Large job socket timed out errors.

Timothy Brown Thu, 24 Sep 2015 13:03:37 -0700

Hi Chansup and Trey,

Thanks, yes the slurmd init script does contain:
ulimit -n 1048576
ulimit -l unlimited
ulimit -s unlimited


However I think we finally figured it out. I'm going to look like a fool when I 
explain it. It's all been a wild goose chase. I didn't check the obvious. 

We mount home (and a couple other file-systems) over NFS with the automounter. 
A few nodes didn't mount home and our health checker was broken, so I was 
assigned them. Then when my job went to run it couldn't on those bad nodes.

We are still trying to figure out why we are having NFS issues, but at this 
stage I'm looking at that, since our network stack on the nodes looks 
consistent (and good in my opinion).

If anybody is interested in our final out come I can let you know. Otherwise 
thanks for all the pointers and help.

Regards
Timothy  

> On Sep 24, 2015, at 8:03 AM, CB <[email protected]> wrote:
> 
> Hi Tim,
> 
> I would also check if the slurmd daemon overwrites the user limits when a job 
> is launched by slurm.
> Submit a job with "ulimit -a" and see what's set when the job is submitted by 
> slurm.
> 
> In other words, I would also check the /proc/<slurmd_process_id>/limits and 
> see what limits the slurmd process have.
> In particular, you may want to check "Max open files" parameter. If this 
> value is lower than the user's limit, the user's limit will be overwritten by 
> slurmd.
> 
> Regards,
> - Chansup
> 
> On Wed, Sep 23, 2015 at 2:13 PM, Timothy Brown <[email protected]> 
> wrote:
> Hi Chansup,
> 
> Yes, that's way up there too:
> 
> node0202 ~$ ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 185698
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1048576
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 2048000
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 399360
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> 
> Interestingly enough I wrote a (ugly hack) bit of code to open sockets in C, 
> it just
> - socket()
> - bind()
> - listen()
> Nothing is ever read/written and I am able to open 28226 sockets on a node.
> 
> I dare say the software stack looks good. Could it possibly be hardware 
> related? Bad IB or???? (I'm starting to grasp at straws here!).
> 
> Thanks
> Timothy
> 
> > On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote:
> >
> > Hi Tim,
> >
> > I'm not sure if you've check the "ulimit -n" value for the user who runs 
> > the job.
> > In my experience, I had to bump up the limit much higher than the default 
> > 1024.
> >
> > Just my 2 cents,
> > - Chansup
> >
> > On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown 
> > <[email protected]> wrote:
> > Hi Moe and Antony,
> >
> > Thanks for the link. On further thinking, I think your right in saying it's 
> > on the Linux network side. In looking at our system we have:
> > /proc/sys/fs/file-max: 2346778
> > /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
> > /proc/sys/net/core/somaxconn: 128
> >
> > So we bumped somaxconn up to 2048 across the cluster.
> >
> > Of course we are in production so I had to wait my (slightly prioritized) 
> > time in the queue.
> >
> > However the jobs still fail with the same error:
> >
> > srun: error: Task launch for 976058.0 failed on node node1505: Socket timed 
> > out on send/recv operation
> >
> > Out of the 500 nodes, we get this error (in this case) for 158 nodes (in 
> > case numbers are helpful).
> >
> > The only other idea I have is related to total TCP memory, we currently 
> > have it set to:
> >
> > /proc/sys/net/ipv4/tcp_mem
> > 2228352 2971136 4456704
> >
> > Which I interpret approximately as 8G, 11G and 17G, while each node has a 
> > total of 24G of ram. So I'm thinking these values are ok. However looking 
> > at other clusters (Stampede) it's set to:
> >
> > c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
> > 16777216        16777216        16777216
> >
> > Which I interpret as 64G (wow!) while they have 32G per node. So am I 
> > interpreting tcp_mem incorrectly?
> >
> > I'm currently waiting in the queue again, but will try this 16777216 for 
> > all values in tcp_mem when I get a job running.
> >
> > We have a txqueuelen of 1024 for the IB interfaces and I don't want to 
> > touch that.
> >
> > Just about everything else I check in proc regarding the network seems ok.
> >
> > Do anybody have any further thoughts or pointers? Thanks!
> >
> > Timothy
> >
> >
> > > On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
> > >
> > >
> > > I suspect that you are hitting some Linux system limit, such as open 
> > > files, or socket backlog. For information on how to address, see:
> > > http://slurm.schedmd.com/big_sys.html
> > >
> > >
> > > Quoting Timothy Brown <[email protected]>:
> > >
> > >> Hi Moe,
> > >>
> > >>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
> > >>>
> > >>>
> > >>> What version of Slurm?
> > >>
> > >> We're currently running 14.11.7
> > >>
> > >>> How many tasks/ranks in your job?
> > >>
> > >> I've been trying 500 nodes with 12 tasks per node, giving a total of 
> > >> 6000. Although after this failed I started
> > >> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It 
> > >> seems anything over 300 to touch and go.
> > >>
> > >>
> > >>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
> > >>
> > >> Not reliably.
> > >>
> > >> $ cat hostname.sh
> > >> #!/bin/bash
> > >> #
> > >> #SBATCH --job-name=OSU_Int
> > >> #SBATCH --qos=admin
> > >> #SBATCH --time=00:15:00
> > >> #SBATCH --nodes=500
> > >> #SBATCH --ntasks-per-node=12
> > >> #SBATCH --account=crcbenchmark
> > >> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
> > >>
> > >>
> > >> srun hostname
> > >>
> > >> $ sbatch hostname.sh
> > >> Submitted batch job 976034
> > >> $ wc -l hostname_976034.txt
> > >> 5992 hostname_976034.txt
> > >> $ grep -v ^node hostname_976034.txt
> > >> srun: error: Task launch for 976034.0 failed on node node0453: Socket 
> > >> timed out on send/recv operation
> > >> srun: error: Application launch failed: Socket timed out on send/recv 
> > >> operation
> > >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > >> srun: error: Timed out waiting for job step to complete
> > >>
> > >> Any thoughts?
> > >>
> > >> Thanks
> > >> Timothy
> > >>
> > >>>
> > >>> Quoting Ralph Castain <[email protected]>:
> > >>>> This sounds like something in Slurm - I don’t know how srun would know 
> > >>>> to emit a message if the app was failing to open a socket between its 
> > >>>> own procs.
> > >>>>
> > >>>> Try starting the OMPI job with “mpirun” instead of srun and see if it 
> > >>>> has the same issue. If not, then that’s pretty convincing that it’s 
> > >>>> slurm.
> > >>>>
> > >>>>
> > >>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown 
> > >>>>> <[email protected]> wrote:
> > >>>>>
> > >>>>>
> > >>>>> Hi Chris,
> > >>>>>
> > >>>>>
> > >>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel 
> > >>>>>> <[email protected]> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 22/09/15 07:17, Timothy Brown wrote:
> > >>>>>>
> > >>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
> > >>>>>>
> > >>>>>> Have you tried Intel MPI's native PMI start up mode?
> > >>>>>>
> > >>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to 
> > >>>>>> the
> > >>>>>> path to the Slurm libpmi.so file and then you should be able to use 
> > >>>>>> srun
> > >>>>>> to launch your job instead.
> > >>>>>>
> > >>>>>
> > >>>>> Yeap, to the same effect. Here's what it gives:
> > >>>>>
> > >>>>> srun --mpi=pmi2 
> > >>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> > >>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket 
> > >>>>> timed out on send/recv operation
> > >>>>> srun: error: Application launch failed: Socket timed out on send/recv 
> > >>>>> operation
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> More here:
> > >>>>>>
> > >>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
> > >>>>>>
> > >>>>>>> If I switch to OpenMPI the error is:
> > >>>>>>
> > >>>>>> Which version, and was it build with --with-slurm and (if you're
> > >>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> > >>>>>
> > >>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to 
> > >>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure 
> > >>>>> statement was:
> > >>>>>
> > >>>>> module purge
> > >>>>> module load slurm/slurm
> > >>>>> module load gcc/5.1.0
> > >>>>> ./configure  \
> > >>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
> > >>>>> --with-threads=posix \
> > >>>>> --enable-mpi-thread-multiple \
> > >>>>> --with-slurm \
> > >>>>> --with-pmi=/curc/slurm/slurm/current/ \
> > >>>>> --enable-static \
> > >>>>> --enable-wrapper-rpath \
> > >>>>> --enable-sensors \
> > >>>>> --enable-mpi-ext=all \
> > >>>>> --with-verbs
> > >>>>>
> > >>>>> It's got me scratching my head, as I started off thinking it was an 
> > >>>>> MPI issue, spent awhile getting Intel's hydra and OpenMPI's oob to go 
> > >>>>> over IB instead of gig-e. This increased the success rate, but we 
> > >>>>> were still failing.
> > >>>>>
> > >>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which 
> > >>>>> worked a lot of the times. Which made me think it was MPI again! 
> > >>>>> However that fails enough to say it's not MPI. The PMI v2 code I 
> > >>>>> wrote, gives the wrong results for rank and world size, so I'm 
> > >>>>> sweeping that under the rug until I understand it!
> > >>>>>
> > >>>>> Just wondering if anybody has seen anything like this. Am happy to 
> > >>>>> share our conf file if that helps.
> > >>>>>
> > >>>>> The only other thing I could possibly point a finger at (but don't 
> > >>>>> believe it is), is that the slurm masters (slurmctld) are only on 
> > >>>>> gig-E.
> > >>>>>
> > >>>>> I'm half thinking of opening a TT, but was hoping to get more 
> > >>>>> information (and possibly not increase the logging of slurm, which is 
> > >>>>> my only next idea).
> > >>>>>
> > >>>>> Thanks for your thoughts Chris.
> > >>>>>
> > >>>>> Timothy=
> > >>>
> > >>>
> > >>> --
> > >>> Morris "Moe" Jette
> > >>> CTO, SchedMD LLC
> > >>> Commercial Slurm Development and Support
> > >
> > >
> > > --
> > > Morris "Moe" Jette
> > > CTO, SchedMD LLC
> > > Commercial Slurm Development and Support
> >
> >
> 
>

[slurm-dev] Re: Large job socket timed out errors.

Reply via email to