Hi Tim,

I would also check if the slurmd daemon overwrites the user limits when a
job is launched by slurm.
Submit a job with "ulimit -a" and see what's set when the job is submitted
by slurm.

In other words, I would also check the /proc/<slurmd_process_id>/limits and
see what limits the slurmd process have.
In particular, you may want to check "Max open files" parameter. If this
value is lower than the user's limit, the user's limit will be overwritten
by slurmd.

Regards,
- Chansup

On Wed, Sep 23, 2015 at 2:13 PM, Timothy Brown <[email protected]
> wrote:

> Hi Chansup,
>
> Yes, that's way up there too:
>
> node0202 ~$ ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 185698
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1048576
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 2048000
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 399360
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
>
> Interestingly enough I wrote a (ugly hack) bit of code to open sockets in
> C, it just
> - socket()
> - bind()
> - listen()
> Nothing is ever read/written and I am able to open 28226 sockets on a node.
>
> I dare say the software stack looks good. Could it possibly be hardware
> related? Bad IB or???? (I'm starting to grasp at straws here!).
>
> Thanks
> Timothy
>
> > On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote:
> >
> > Hi Tim,
> >
> > I'm not sure if you've check the "ulimit -n" value for the user who runs
> the job.
> > In my experience, I had to bump up the limit much higher than the
> default 1024.
> >
> > Just my 2 cents,
> > - Chansup
> >
> > On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown <
> [email protected]> wrote:
> > Hi Moe and Antony,
> >
> > Thanks for the link. On further thinking, I think your right in saying
> it's on the Linux network side. In looking at our system we have:
> > /proc/sys/fs/file-max: 2346778
> > /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
> > /proc/sys/net/core/somaxconn: 128
> >
> > So we bumped somaxconn up to 2048 across the cluster.
> >
> > Of course we are in production so I had to wait my (slightly
> prioritized) time in the queue.
> >
> > However the jobs still fail with the same error:
> >
> > srun: error: Task launch for 976058.0 failed on node node1505: Socket
> timed out on send/recv operation
> >
> > Out of the 500 nodes, we get this error (in this case) for 158 nodes (in
> case numbers are helpful).
> >
> > The only other idea I have is related to total TCP memory, we currently
> have it set to:
> >
> > /proc/sys/net/ipv4/tcp_mem
> > 2228352 2971136 4456704
> >
> > Which I interpret approximately as 8G, 11G and 17G, while each node has
> a total of 24G of ram. So I'm thinking these values are ok. However looking
> at other clusters (Stampede) it's set to:
> >
> > c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
> > 16777216        16777216        16777216
> >
> > Which I interpret as 64G (wow!) while they have 32G per node. So am I
> interpreting tcp_mem incorrectly?
> >
> > I'm currently waiting in the queue again, but will try this 16777216 for
> all values in tcp_mem when I get a job running.
> >
> > We have a txqueuelen of 1024 for the IB interfaces and I don't want to
> touch that.
> >
> > Just about everything else I check in proc regarding the network seems
> ok.
> >
> > Do anybody have any further thoughts or pointers? Thanks!
> >
> > Timothy
> >
> >
> > > On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
> > >
> > >
> > > I suspect that you are hitting some Linux system limit, such as open
> files, or socket backlog. For information on how to address, see:
> > > http://slurm.schedmd.com/big_sys.html
> > >
> > >
> > > Quoting Timothy Brown <[email protected]>:
> > >
> > >> Hi Moe,
> > >>
> > >>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
> > >>>
> > >>>
> > >>> What version of Slurm?
> > >>
> > >> We're currently running 14.11.7
> > >>
> > >>> How many tasks/ranks in your job?
> > >>
> > >> I've been trying 500 nodes with 12 tasks per node, giving a total of
> 6000. Although after this failed I started
> > >> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It
> seems anything over 300 to touch and go.
> > >>
> > >>
> > >>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
> > >>
> > >> Not reliably.
> > >>
> > >> $ cat hostname.sh
> > >> #!/bin/bash
> > >> #
> > >> #SBATCH --job-name=OSU_Int
> > >> #SBATCH --qos=admin
> > >> #SBATCH --time=00:15:00
> > >> #SBATCH --nodes=500
> > >> #SBATCH --ntasks-per-node=12
> > >> #SBATCH --account=crcbenchmark
> > >> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
> > >>
> > >>
> > >> srun hostname
> > >>
> > >> $ sbatch hostname.sh
> > >> Submitted batch job 976034
> > >> $ wc -l hostname_976034.txt
> > >> 5992 hostname_976034.txt
> > >> $ grep -v ^node hostname_976034.txt
> > >> srun: error: Task launch for 976034.0 failed on node node0453: Socket
> timed out on send/recv operation
> > >> srun: error: Application launch failed: Socket timed out on send/recv
> operation
> > >> srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
> > >> srun: error: Timed out waiting for job step to complete
> > >>
> > >> Any thoughts?
> > >>
> > >> Thanks
> > >> Timothy
> > >>
> > >>>
> > >>> Quoting Ralph Castain <[email protected]>:
> > >>>> This sounds like something in Slurm - I don’t know how srun would
> know to emit a message if the app was failing to open a socket between its
> own procs.
> > >>>>
> > >>>> Try starting the OMPI job with “mpirun” instead of srun and see if
> it has the same issue. If not, then that’s pretty convincing that it’s
> slurm.
> > >>>>
> > >>>>
> > >>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown
> <[email protected]> wrote:
> > >>>>>
> > >>>>>
> > >>>>> Hi Chris,
> > >>>>>
> > >>>>>
> > >>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <
> [email protected]> wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>> On 22/09/15 07:17, Timothy Brown wrote:
> > >>>>>>
> > >>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
> > >>>>>>
> > >>>>>> Have you tried Intel MPI's native PMI start up mode?
> > >>>>>>
> > >>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY
> to the
> > >>>>>> path to the Slurm libpmi.so file and then you should be able to
> use srun
> > >>>>>> to launch your job instead.
> > >>>>>>
> > >>>>>
> > >>>>> Yeap, to the same effect. Here's what it gives:
> > >>>>>
> > >>>>> srun --mpi=pmi2
> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> > >>>>> srun: error: Task launch for 973564.0 failed on node node0453:
> Socket timed out on send/recv operation
> > >>>>> srun: error: Application launch failed: Socket timed out on
> send/recv operation
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> More here:
> > >>>>>>
> > >>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
> > >>>>>>
> > >>>>>>> If I switch to OpenMPI the error is:
> > >>>>>>
> > >>>>>> Which version, and was it build with --with-slurm and (if you're
> > >>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> > >>>>>
> > >>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to
> EasyBuild). Yes we included PMI and the Slurm option. Our configure
> statement was:
> > >>>>>
> > >>>>> module purge
> > >>>>> module load slurm/slurm
> > >>>>> module load gcc/5.1.0
> > >>>>> ./configure  \
> > >>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
> > >>>>> --with-threads=posix \
> > >>>>> --enable-mpi-thread-multiple \
> > >>>>> --with-slurm \
> > >>>>> --with-pmi=/curc/slurm/slurm/current/ \
> > >>>>> --enable-static \
> > >>>>> --enable-wrapper-rpath \
> > >>>>> --enable-sensors \
> > >>>>> --enable-mpi-ext=all \
> > >>>>> --with-verbs
> > >>>>>
> > >>>>> It's got me scratching my head, as I started off thinking it was
> an MPI issue, spent awhile getting Intel's hydra and OpenMPI's oob to go
> over IB instead of gig-e. This increased the success rate, but we were
> still failing.
> > >>>>>
> > >>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini),
> which worked a lot of the times. Which made me think it was MPI again!
> However that fails enough to say it's not MPI. The PMI v2 code I wrote,
> gives the wrong results for rank and world size, so I'm sweeping that under
> the rug until I understand it!
> > >>>>>
> > >>>>> Just wondering if anybody has seen anything like this. Am happy to
> share our conf file if that helps.
> > >>>>>
> > >>>>> The only other thing I could possibly point a finger at (but don't
> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
> > >>>>>
> > >>>>> I'm half thinking of opening a TT, but was hoping to get more
> information (and possibly not increase the logging of slurm, which is my
> only next idea).
> > >>>>>
> > >>>>> Thanks for your thoughts Chris.
> > >>>>>
> > >>>>> Timothy=
> > >>>
> > >>>
> > >>> --
> > >>> Morris "Moe" Jette
> > >>> CTO, SchedMD LLC
> > >>> Commercial Slurm Development and Support
> > >
> > >
> > > --
> > > Morris "Moe" Jette
> > > CTO, SchedMD LLC
> > > Commercial Slurm Development and Support
> >
> >
>
>

Reply via email to