[slurm-dev] Re: Large job socket timed out errors.

Timothy Brown Tue, 22 Sep 2015 07:49:15 -0700

Hi Moe,

> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
> 
> 
> What version of Slurm?


We're currently running 14.11.7 

> How many tasks/ranks in your job?

I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. 
Although after this failed I started
fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems 
anything over 300 to touch and go.


> Can you run a non-MPI job of the same size (i.e. srun hostname)?

Not reliably.

$ cat hostname.sh 
#!/bin/bash
#
#SBATCH --job-name=OSU_Int
#SBATCH --qos=admin
#SBATCH --time=00:15:00
#SBATCH --nodes=500
#SBATCH --ntasks-per-node=12
#SBATCH --account=crcbenchmark
#SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt


srun hostname

$ sbatch hostname.sh
Submitted batch job 976034
$ wc -l hostname_976034.txt
5992 hostname_976034.txt
$ grep -v ^node hostname_976034.txt
srun: error: Task launch for 976034.0 failed on node node0453: Socket timed out 
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Any thoughts?

Thanks
Timothy

> 
> Quoting Ralph Castain <[email protected]>:
>> This sounds like something in Slurm - I don’t know how srun would know to 
>> emit a message if the app was failing to open a socket between its own procs.
>> 
>> Try starting the OMPI job with “mpirun” instead of srun and see if it has 
>> the same issue. If not, then that’s pretty convincing that it’s slurm.
>> 
>> 
>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown <[email protected]> 
>>> wrote:
>>> 
>>> 
>>> Hi Chris,
>>> 
>>> 
>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]> 
>>>> wrote:
>>>> 
>>>> 
>>>> On 22/09/15 07:17, Timothy Brown wrote:
>>>> 
>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
>>>> 
>>>> Have you tried Intel MPI's native PMI start up mode?
>>>> 
>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>>>> path to the Slurm libpmi.so file and then you should be able to use srun
>>>> to launch your job instead.
>>>> 
>>> 
>>> Yeap, to the same effect. Here's what it gives:
>>> 
>>> srun --mpi=pmi2 
>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket timed 
>>> out on send/recv operation
>>> srun: error: Application launch failed: Socket timed out on send/recv 
>>> operation
>>> 
>>> 
>>> 
>>>> More here:
>>>> 
>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>>>> 
>>>>> If I switch to OpenMPI the error is:
>>>> 
>>>> Which version, and was it build with --with-slurm and (if you're
>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
>>> 
>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to EasyBuild). 
>>> Yes we included PMI and the Slurm option. Our configure statement was:
>>> 
>>> module purge
>>> module load slurm/slurm
>>> module load gcc/5.1.0
>>> ./configure  \
>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>>> --with-threads=posix \
>>> --enable-mpi-thread-multiple \
>>> --with-slurm \
>>> --with-pmi=/curc/slurm/slurm/current/ \
>>> --enable-static \
>>> --enable-wrapper-rpath \
>>> --enable-sensors \
>>> --enable-mpi-ext=all \
>>> --with-verbs
>>> 
>>> It's got me scratching my head, as I started off thinking it was an MPI 
>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over IB 
>>> instead of gig-e. This increased the success rate, but we were still 
>>> failing.
>>> 
>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which 
>>> worked a lot of the times. Which made me think it was MPI again! However 
>>> that fails enough to say it's not MPI. The PMI v2 code I wrote, gives the 
>>> wrong results for rank and world size, so I'm sweeping that under the rug 
>>> until I understand it!
>>> 
>>> Just wondering if anybody has seen anything like this. Am happy to share 
>>> our conf file if that helps.
>>> 
>>> The only other thing I could possibly point a finger at (but don't believe 
>>> it is), is that the slurm masters (slurmctld) are only on gig-E.
>>> 
>>> I'm half thinking of opening a TT, but was hoping to get more information 
>>> (and possibly not increase the logging of slurm, which is my only next 
>>> idea).
>>> 
>>> Thanks for your thoughts Chris.
>>> 
>>> Timothy=
> 
> 
> -- 
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support

[slurm-dev] Re: Large job socket timed out errors.

Reply via email to