[slurm-dev] Re: Large job socket timed out errors.

Timothy Brown Wed, 23 Sep 2015 06:56:14 -0700

Hi Moe and Antony,

Thanks for the link. On further thinking, I think your right in saying it's on 
the Linux network side. In looking at our system we have:
/proc/sys/fs/file-max: 2346778
/proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
/proc/sys/net/core/somaxconn: 128


So we bumped somaxconn up to 2048 across the cluster.

Of course we are in production so I had to wait my (slightly prioritized) time 
in the queue.

However the jobs still fail with the same error:

srun: error: Task launch for 976058.0 failed on node node1505: Socket timed out 
on send/recv operation

Out of the 500 nodes, we get this error (in this case) for 158 nodes (in case 
numbers are helpful).

The only other idea I have is related to total TCP memory, we currently have it 
set to:

/proc/sys/net/ipv4/tcp_mem 
2228352 2971136 4456704

Which I interpret approximately as 8G, 11G and 17G, while each node has a total 
of 24G of ram. So I'm thinking these values are ok. However looking at other 
clusters (Stampede) it's set to:

c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
16777216        16777216        16777216

Which I interpret as 64G (wow!) while they have 32G per node. So am I 
interpreting tcp_mem incorrectly?

I'm currently waiting in the queue again, but will try this 16777216 for all 
values in tcp_mem when I get a job running.

We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch 
that.

Just about everything else I check in proc regarding the network seems ok.

Do anybody have any further thoughts or pointers? Thanks!

Timothy


> On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
> 
> 
> I suspect that you are hitting some Linux system limit, such as open files, 
> or socket backlog. For information on how to address, see:
> http://slurm.schedmd.com/big_sys.html
> 
> 
> Quoting Timothy Brown <[email protected]>:
> 
>> Hi Moe,
>> 
>>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
>>> 
>>> 
>>> What version of Slurm?
>> 
>> We're currently running 14.11.7
>> 
>>> How many tasks/ranks in your job?
>> 
>> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. 
>> Although after this failed I started
>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems 
>> anything over 300 to touch and go.
>> 
>> 
>>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
>> 
>> Not reliably.
>> 
>> $ cat hostname.sh
>> #!/bin/bash
>> #
>> #SBATCH --job-name=OSU_Int
>> #SBATCH --qos=admin
>> #SBATCH --time=00:15:00
>> #SBATCH --nodes=500
>> #SBATCH --ntasks-per-node=12
>> #SBATCH --account=crcbenchmark
>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
>> 
>> 
>> srun hostname
>> 
>> $ sbatch hostname.sh
>> Submitted batch job 976034
>> $ wc -l hostname_976034.txt
>> 5992 hostname_976034.txt
>> $ grep -v ^node hostname_976034.txt
>> srun: error: Task launch for 976034.0 failed on node node0453: Socket timed 
>> out on send/recv operation
>> srun: error: Application launch failed: Socket timed out on send/recv 
>> operation
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>> 
>> Any thoughts?
>> 
>> Thanks
>> Timothy
>> 
>>> 
>>> Quoting Ralph Castain <[email protected]>:
>>>> This sounds like something in Slurm - I don’t know how srun would know to 
>>>> emit a message if the app was failing to open a socket between its own 
>>>> procs.
>>>> 
>>>> Try starting the OMPI job with “mpirun” instead of srun and see if it has 
>>>> the same issue. If not, then that’s pretty convincing that it’s slurm.
>>>> 
>>>> 
>>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> Hi Chris,
>>>>> 
>>>>> 
>>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On 22/09/15 07:17, Timothy Brown wrote:
>>>>>> 
>>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
>>>>>> 
>>>>>> Have you tried Intel MPI's native PMI start up mode?
>>>>>> 
>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>>>>>> path to the Slurm libpmi.so file and then you should be able to use srun
>>>>>> to launch your job instead.
>>>>>> 
>>>>> 
>>>>> Yeap, to the same effect. Here's what it gives:
>>>>> 
>>>>> srun --mpi=pmi2 
>>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
>>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket 
>>>>> timed out on send/recv operation
>>>>> srun: error: Application launch failed: Socket timed out on send/recv 
>>>>> operation
>>>>> 
>>>>> 
>>>>> 
>>>>>> More here:
>>>>>> 
>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>>>>>> 
>>>>>>> If I switch to OpenMPI the error is:
>>>>>> 
>>>>>> Which version, and was it build with --with-slurm and (if you're
>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
>>>>> 
>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to 
>>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure 
>>>>> statement was:
>>>>> 
>>>>> module purge
>>>>> module load slurm/slurm
>>>>> module load gcc/5.1.0
>>>>> ./configure  \
>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>>>>> --with-threads=posix \
>>>>> --enable-mpi-thread-multiple \
>>>>> --with-slurm \
>>>>> --with-pmi=/curc/slurm/slurm/current/ \
>>>>> --enable-static \
>>>>> --enable-wrapper-rpath \
>>>>> --enable-sensors \
>>>>> --enable-mpi-ext=all \
>>>>> --with-verbs
>>>>> 
>>>>> It's got me scratching my head, as I started off thinking it was an MPI 
>>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over IB 
>>>>> instead of gig-e. This increased the success rate, but we were still 
>>>>> failing.
>>>>> 
>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which 
>>>>> worked a lot of the times. Which made me think it was MPI again! However 
>>>>> that fails enough to say it's not MPI. The PMI v2 code I wrote, gives the 
>>>>> wrong results for rank and world size, so I'm sweeping that under the rug 
>>>>> until I understand it!
>>>>> 
>>>>> Just wondering if anybody has seen anything like this. Am happy to share 
>>>>> our conf file if that helps.
>>>>> 
>>>>> The only other thing I could possibly point a finger at (but don't 
>>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
>>>>> 
>>>>> I'm half thinking of opening a TT, but was hoping to get more information 
>>>>> (and possibly not increase the logging of slurm, which is my only next 
>>>>> idea).
>>>>> 
>>>>> Thanks for your thoughts Chris.
>>>>> 
>>>>> Timothy=
>>> 
>>> 
>>> --
>>> Morris "Moe" Jette
>>> CTO, SchedMD LLC
>>> Commercial Slurm Development and Support
> 
> 
> -- 
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support

[slurm-dev] Re: Large job socket timed out errors.

Reply via email to