[slurm-dev] Re: Large job socket timed out errors.

Antony Cleave Wed, 23 Sep 2015 00:15:33 -0700

I've seen  similar behaviour on another system about a year ago and it was
due to socket limits. We fixed it by implementing the high throughput
cluster suggestions.


Antony
On 22 Sep 2015 15:56, "Moe Jette" <[email protected]> wrote:

>
> I suspect that you are hitting some Linux system limit, such as open
> files, or socket backlog. For information on how to address, see:
> http://slurm.schedmd.com/big_sys.html
>
>
> Quoting Timothy Brown <[email protected]>:
>
> Hi Moe,
>>
>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
>>>
>>>
>>> What version of Slurm?
>>>
>>
>> We're currently running 14.11.7
>>
>> How many tasks/ranks in your job?
>>>
>>
>> I've been trying 500 nodes with 12 tasks per node, giving a total of
>> 6000. Although after this failed I started
>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems
>> anything over 300 to touch and go.
>>
>>
>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
>>>
>>
>> Not reliably.
>>
>> $ cat hostname.sh
>> #!/bin/bash
>> #
>> #SBATCH --job-name=OSU_Int
>> #SBATCH --qos=admin
>> #SBATCH --time=00:15:00
>> #SBATCH --nodes=500
>> #SBATCH --ntasks-per-node=12
>> #SBATCH --account=crcbenchmark
>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
>>
>>
>> srun hostname
>>
>> $ sbatch hostname.sh
>> Submitted batch job 976034
>> $ wc -l hostname_976034.txt
>> 5992 hostname_976034.txt
>> $ grep -v ^node hostname_976034.txt
>> srun: error: Task launch for 976034.0 failed on node node0453: Socket
>> timed out on send/recv operation
>> srun: error: Application launch failed: Socket timed out on send/recv
>> operation
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> Any thoughts?
>>
>> Thanks
>> Timothy
>>
>>
>>> Quoting Ralph Castain <[email protected]>:
>>>
>>>> This sounds like something in Slurm - I don’t know how srun would know
>>>> to emit a message if the app was failing to open a socket between its own
>>>> procs.
>>>>
>>>> Try starting the OMPI job with “mpirun” instead of srun and see if it
>>>> has the same issue. If not, then that’s pretty convincing that it’s slurm.
>>>>
>>>>
>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown <[email protected]>
>>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Chris,
>>>>>
>>>>>
>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On 22/09/15 07:17, Timothy Brown wrote:
>>>>>>
>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
>>>>>>>
>>>>>>
>>>>>> Have you tried Intel MPI's native PMI start up mode?
>>>>>>
>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>>>>>> path to the Slurm libpmi.so file and then you should be able to use
>>>>>> srun
>>>>>> to launch your job instead.
>>>>>>
>>>>>>
>>>>> Yeap, to the same effect. Here's what it gives:
>>>>>
>>>>> srun --mpi=pmi2
>>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
>>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket
>>>>> timed out on send/recv operation
>>>>> srun: error: Application launch failed: Socket timed out on send/recv
>>>>> operation
>>>>>
>>>>>
>>>>>
>>>>> More here:
>>>>>>
>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>>>>>>
>>>>>> If I switch to OpenMPI the error is:
>>>>>>>
>>>>>>
>>>>>> Which version, and was it build with --with-slurm and (if you're
>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
>>>>>>
>>>>>
>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to
>>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure
>>>>> statement was:
>>>>>
>>>>> module purge
>>>>> module load slurm/slurm
>>>>> module load gcc/5.1.0
>>>>> ./configure  \
>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>>>>> --with-threads=posix \
>>>>> --enable-mpi-thread-multiple \
>>>>> --with-slurm \
>>>>> --with-pmi=/curc/slurm/slurm/current/ \
>>>>> --enable-static \
>>>>> --enable-wrapper-rpath \
>>>>> --enable-sensors \
>>>>> --enable-mpi-ext=all \
>>>>> --with-verbs
>>>>>
>>>>> It's got me scratching my head, as I started off thinking it was an
>>>>> MPI issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over
>>>>> IB instead of gig-e. This increased the success rate, but we were still
>>>>> failing.
>>>>>
>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which
>>>>> worked a lot of the times. Which made me think it was MPI again! However
>>>>> that fails enough to say it's not MPI. The PMI v2 code I wrote, gives the
>>>>> wrong results for rank and world size, so I'm sweeping that under the rug
>>>>> until I understand it!
>>>>>
>>>>> Just wondering if anybody has seen anything like this. Am happy to
>>>>> share our conf file if that helps.
>>>>>
>>>>> The only other thing I could possibly point a finger at (but don't
>>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
>>>>>
>>>>> I'm half thinking of opening a TT, but was hoping to get more
>>>>> information (and possibly not increase the logging of slurm, which is my
>>>>> only next idea).
>>>>>
>>>>> Thanks for your thoughts Chris.
>>>>>
>>>>> Timothy=
>>>>>
>>>>
>>>
>>> --
>>> Morris "Moe" Jette
>>> CTO, SchedMD LLC
>>> Commercial Slurm Development and Support
>>>
>>
>
> --
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support
>

[slurm-dev] Re: Large job socket timed out errors.

Reply via email to