I tried a couple more things this afternoon.

A 250 node job (12 tasks per node), however before running srun, I set 
PMI_TIME=4000, this is the error I received:

size = 3000, rank = 2853
size = 3000, rank = 1853
size = 3000, rank = 2353
size = 3000, rank = 853
srun: error: timeout waiting for task launch, started 2976 of 3000 tasks
srun: Job step 977359.1 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


So digging more into this. I figured out system/topology.

slurm[1,2] are on a VM, connected with gigE. 
nodes have gigE and IB (their hostnames are nodeXXXX, nodeXXXXib).
Slurm allocates jobs ($SLURM_NODELIST) to nodeXXXX.

A tcupdump of the job startup shows it's all going over eth0, not ib0.

Could this possibly be it?

In looking at Stampede (the only other large Slurm deployment I have access to) 
it seems everything is over IB.
Also in comparing our /pro/sys/net (sysctl) we are almost identical.

Thoughts? Comments?

Thanks
Timothy

> On Sep 23, 2015, at 12:14 PM, Timothy Brown <[email protected]> 
> wrote:
> 
> Hi Chansup,
> 
> Yes, that's way up there too:
> 
> node0202 ~$ ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 185698
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1048576
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 2048000
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 399360
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
> 
> 
> Interestingly enough I wrote a (ugly hack) bit of code to open sockets in C, 
> it just
> - socket()
> - bind()
> - listen()
> Nothing is ever read/written and I am able to open 28226 sockets on a node. 
> 
> I dare say the software stack looks good. Could it possibly be hardware 
> related? Bad IB or???? (I'm starting to grasp at straws here!).
> 
> Thanks
> Timothy
> 
>> On Sep 23, 2015, at 11:41 AM, CB <[email protected]> wrote:
>> 
>> Hi Tim,
>> 
>> I'm not sure if you've check the "ulimit -n" value for the user who runs the 
>> job.
>> In my experience, I had to bump up the limit much higher than the default 
>> 1024.
>> 
>> Just my 2 cents,
>> - Chansup
>> 
>> On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown 
>> <[email protected]> wrote:
>> Hi Moe and Antony,
>> 
>> Thanks for the link. On further thinking, I think your right in saying it's 
>> on the Linux network side. In looking at our system we have:
>> /proc/sys/fs/file-max: 2346778
>> /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
>> /proc/sys/net/core/somaxconn: 128
>> 
>> So we bumped somaxconn up to 2048 across the cluster.
>> 
>> Of course we are in production so I had to wait my (slightly prioritized) 
>> time in the queue.
>> 
>> However the jobs still fail with the same error:
>> 
>> srun: error: Task launch for 976058.0 failed on node node1505: Socket timed 
>> out on send/recv operation
>> 
>> Out of the 500 nodes, we get this error (in this case) for 158 nodes (in 
>> case numbers are helpful).
>> 
>> The only other idea I have is related to total TCP memory, we currently have 
>> it set to:
>> 
>> /proc/sys/net/ipv4/tcp_mem
>> 2228352 2971136 4456704
>> 
>> Which I interpret approximately as 8G, 11G and 17G, while each node has a 
>> total of 24G of ram. So I'm thinking these values are ok. However looking at 
>> other clusters (Stampede) it's set to:
>> 
>> c560-204.stampede(2)$ cat /proc/sys/net/ipv4/tcp_mem
>> 16777216        16777216        16777216
>> 
>> Which I interpret as 64G (wow!) while they have 32G per node. So am I 
>> interpreting tcp_mem incorrectly?
>> 
>> I'm currently waiting in the queue again, but will try this 16777216 for all 
>> values in tcp_mem when I get a job running.
>> 
>> We have a txqueuelen of 1024 for the IB interfaces and I don't want to touch 
>> that.
>> 
>> Just about everything else I check in proc regarding the network seems ok.
>> 
>> Do anybody have any further thoughts or pointers? Thanks!
>> 
>> Timothy
>> 
>> 
>>> On Sep 22, 2015, at 8:57 AM, Moe Jette <[email protected]> wrote:
>>> 
>>> 
>>> I suspect that you are hitting some Linux system limit, such as open files, 
>>> or socket backlog. For information on how to address, see:
>>> http://slurm.schedmd.com/big_sys.html
>>> 
>>> 
>>> Quoting Timothy Brown <[email protected]>:
>>> 
>>>> Hi Moe,
>>>> 
>>>>> On Sep 21, 2015, at 10:02 PM, Moe Jette <[email protected]> wrote:
>>>>> 
>>>>> 
>>>>> What version of Slurm?
>>>> 
>>>> We're currently running 14.11.7
>>>> 
>>>>> How many tasks/ranks in your job?
>>>> 
>>>> I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. 
>>>> Although after this failed I started
>>>> fiddling with less (100 nodes and ramping up, 200, 300, 400 ..). It seems 
>>>> anything over 300 to touch and go.
>>>> 
>>>> 
>>>>> Can you run a non-MPI job of the same size (i.e. srun hostname)?
>>>> 
>>>> Not reliably.
>>>> 
>>>> $ cat hostname.sh
>>>> #!/bin/bash
>>>> #
>>>> #SBATCH --job-name=OSU_Int
>>>> #SBATCH --qos=admin
>>>> #SBATCH --time=00:15:00
>>>> #SBATCH --nodes=500
>>>> #SBATCH --ntasks-per-node=12
>>>> #SBATCH --account=crcbenchmark
>>>> #SBATCH --output=/lustre/janus_scratch/tibr1099/hostname_%A.txt
>>>> 
>>>> 
>>>> srun hostname
>>>> 
>>>> $ sbatch hostname.sh
>>>> Submitted batch job 976034
>>>> $ wc -l hostname_976034.txt
>>>> 5992 hostname_976034.txt
>>>> $ grep -v ^node hostname_976034.txt
>>>> srun: error: Task launch for 976034.0 failed on node node0453: Socket 
>>>> timed out on send/recv operation
>>>> srun: error: Application launch failed: Socket timed out on send/recv 
>>>> operation
>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>>> srun: error: Timed out waiting for job step to complete
>>>> 
>>>> Any thoughts?
>>>> 
>>>> Thanks
>>>> Timothy
>>>> 
>>>>> 
>>>>> Quoting Ralph Castain <[email protected]>:
>>>>>> This sounds like something in Slurm - I don’t know how srun would know 
>>>>>> to emit a message if the app was failing to open a socket between its 
>>>>>> own procs.
>>>>>> 
>>>>>> Try starting the OMPI job with “mpirun” instead of srun and see if it 
>>>>>> has the same issue. If not, then that’s pretty convincing that it’s 
>>>>>> slurm.
>>>>>> 
>>>>>> 
>>>>>>> On Sep 21, 2015, at 7:26 PM, Timothy Brown 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Chris,
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 22/09/15 07:17, Timothy Brown wrote:
>>>>>>>> 
>>>>>>>>> This is using mpiexec.hydra with slurm as the bootstrap.
>>>>>>>> 
>>>>>>>> Have you tried Intel MPI's native PMI start up mode?
>>>>>>>> 
>>>>>>>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>>>>>>>> path to the Slurm libpmi.so file and then you should be able to use 
>>>>>>>> srun
>>>>>>>> to launch your job instead.
>>>>>>>> 
>>>>>>> 
>>>>>>> Yeap, to the same effect. Here's what it gives:
>>>>>>> 
>>>>>>> srun --mpi=pmi2 
>>>>>>> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
>>>>>>> srun: error: Task launch for 973564.0 failed on node node0453: Socket 
>>>>>>> timed out on send/recv operation
>>>>>>> srun: error: Application launch failed: Socket timed out on send/recv 
>>>>>>> operation
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> More here:
>>>>>>>> 
>>>>>>>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>>>>>>>> 
>>>>>>>>> If I switch to OpenMPI the error is:
>>>>>>>> 
>>>>>>>> Which version, and was it build with --with-slurm and (if you're
>>>>>>>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
>>>>>>> 
>>>>>>> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to 
>>>>>>> EasyBuild). Yes we included PMI and the Slurm option. Our configure 
>>>>>>> statement was:
>>>>>>> 
>>>>>>> module purge
>>>>>>> module load slurm/slurm
>>>>>>> module load gcc/5.1.0
>>>>>>> ./configure  \
>>>>>>> --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>>>>>>> --with-threads=posix \
>>>>>>> --enable-mpi-thread-multiple \
>>>>>>> --with-slurm \
>>>>>>> --with-pmi=/curc/slurm/slurm/current/ \
>>>>>>> --enable-static \
>>>>>>> --enable-wrapper-rpath \
>>>>>>> --enable-sensors \
>>>>>>> --enable-mpi-ext=all \
>>>>>>> --with-verbs
>>>>>>> 
>>>>>>> It's got me scratching my head, as I started off thinking it was an MPI 
>>>>>>> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over 
>>>>>>> IB instead of gig-e. This increased the success rate, but we were still 
>>>>>>> failing.
>>>>>>> 
>>>>>>> Tried out a pure PMI (version 1) code (init, rank, size, fini), which 
>>>>>>> worked a lot of the times. Which made me think it was MPI again! 
>>>>>>> However that fails enough to say it's not MPI. The PMI v2 code I wrote, 
>>>>>>> gives the wrong results for rank and world size, so I'm sweeping that 
>>>>>>> under the rug until I understand it!
>>>>>>> 
>>>>>>> Just wondering if anybody has seen anything like this. Am happy to 
>>>>>>> share our conf file if that helps.
>>>>>>> 
>>>>>>> The only other thing I could possibly point a finger at (but don't 
>>>>>>> believe it is), is that the slurm masters (slurmctld) are only on gig-E.
>>>>>>> 
>>>>>>> I'm half thinking of opening a TT, but was hoping to get more 
>>>>>>> information (and possibly not increase the logging of slurm, which is 
>>>>>>> my only next idea).
>>>>>>> 
>>>>>>> Thanks for your thoughts Chris.
>>>>>>> 
>>>>>>> Timothy=
>>>>> 
>>>>> 
>>>>> --
>>>>> Morris "Moe" Jette
>>>>> CTO, SchedMD LLC
>>>>> Commercial Slurm Development and Support
>>> 
>>> 
>>> --
>>> Morris "Moe" Jette
>>> CTO, SchedMD LLC
>>> Commercial Slurm Development and Support
>> 
>> 
> 

Reply via email to