Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Ralph Castain Tue, 2 Sep 2014 14:04:03 -0400 (EDT)

On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote:


> Ralph,
> 
> There are at least three different permutations of CPU configurations in the 
> cluster
> involved. Some are blades that have two sockets with two cores per Intel CPU 
> (and not all
> sockets are filled). Some are IBM x3550 systems having two sockets with three 
> cores
> per Intel CPU (and not all sockets are populated). All nodes have 
> hyperthreading turned
> on as well.
> 
> I will look into getting the numactl-devel package installed.
> 
> I will try the --bind-to none switch again. For some reason the 
> --hetero-nodes switch wasn't
> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter?

My bad - I forgot that you are using a very old OMPI version. I think you'll 
need to upgrade, though, as I don't believe something that old will know how to 
handle such a hybrid system. I suspect this may be at the bottom of the problem 
you are seeing.

You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure even 
1.6 can handle this setup.

> 
> Thanks for your help.
> 
> -Bill Lane
> ________________________________________
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Saturday, August 30, 2014 7:15 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28       slots 
>   (updated findings)
> 
> hwloc requires the numactl-devel package in addition to the numactl one
> 
> If I understand the email thread correctly, it sounds like you have at least 
> some nodes in your system that have fewer cores than others - is that correct?
> 
>>> Here are the definitions of the two parallel environments tested (with orte 
>>> always failing when
>>> more slots are requested than there are CPU cores on the first node 
>>> allocated to the job by
>>> SGE):
> 
> If that is the situation, then you need to add --hetero-nodes to your cmd 
> line so we look at the actual topology of every node. Otherwise, for 
> scalability reasons, we only look at the first node in the allocation and 
> assume all nodes are the same.
> 
> If that isn't the case, then it sounds like we are seeing fewer cores than 
> exist on the system for some reason. You could try installing hwloc 
> independently, and then running "lstopo" to find out what it detects. Another 
> thing you could do is add "-mca plm_base_verbose 100" to your cmd line (I 
> suggest doing that with just a couple of nodes in your allocation) and that 
> will dump the detected topology to stderr.
> 
> I'm surprised the bind-to none option didn't remove the error - it definitely 
> should as we won't be binding when that is given. However, I note that you 
> misspelled it in your reply, so maybe you just didn't type it correctly? It 
> is "--bind-to none" - note the space between the "to" and the "none". You'll 
> take a performance hit, but it should at least run.
> 
> 
> 
> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote:
> 
>> The --bind-to-none switch didn't help, I'm still getting the same errors.
>> 
>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is 
>> the
>> following:
>> 
>> numactl-2.0.7-3.el6.x86_64
>> this package is described as: numactl.x86_64 : Library for tuning for Non 
>> Uniform Memory Access machines
>> 
>> Since many of these systems are NUMA systems (with separate memory address 
>> spaces
>> for the sockets) could it be that the correct NUMA libraries aren't 
>> installed?
>> 
>> Here are some of the other NUMA packages available for CentOS 6.x:
>> 
>> yum search numa | less
>> 
>>               Loaded plugins: fastestmirror
>>               Loading mirror speeds from cached hostfile
>>               ============================== N/S Matched: numa 
>> ===============================
>>               numactl-devel.i686 : Development package for building 
>> Applications that use numa
>>               numactl-devel.x86_64 : Development package for building 
>> Applications that use
>>                                    : numa
>>               numad.x86_64 : NUMA user daemon
>>               numactl.i686 : Library for tuning for Non Uniform Memory 
>> Access machines
>>               numactl.x86_64 : Library for tuning for Non Uniform Memory 
>> Access machines
>> 
>> -Bill Lane
>> ________________________________________
>> From: users [users-boun...@open-mpi.org] on behalf of Reuti 
>> [re...@staff.uni-marburg.de]
>> Sent: Thursday, August 28, 2014 3:27 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots 
>> (updated findings)
>> 
>> Am 28.08.2014 um 10:09 schrieb Lane, William:
>> 
>>> I have some updates on these issues and some test results as well.
>>> 
>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs 
>>> via the SGE orte parallel environment received
>>> errors whenever more slots are requested than there are actual cores on the 
>>> first node allocated to the job.
>> 
>> Does "-bind-to none" help? The binding is switched on by default in Open MPI 
>> 1.8 onwards.
>> 
>> 
>>> The btl tcp,self switch passed to mpirun made significant differences in 
>>> performance as per the below:
>>> 
>>> Even with the oversubscribe option, the memory mapping errors still 
>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  it 
>>> reliably starts failing at 20 cores allocated. Note that I tested with 'btl 
>>> tcp,self' defined and it does slow down the solve by 2 on a quick solve. 
>>> The results on a larger solve would probably be more dramatic:
>>> - Quick HPL 16 core with SM: ~19GFlops
>>> - Quick HPL 16 core without SM: ~10GFlops
>>> 
>>> Unfortunately, a recompiled HPL did not work, but it did give us more 
>>> information (error below). Still trying a couple things.
>>> 
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>> 
>>> Bind to:     CORE
>>> Node:        csclprd3-0-7
>>> #processes:  2
>>> #cpus:       1
>>> 
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>> 
>>> When using the SGE make parallel environment to submit jobs everything 
>>> worked perfectly.
>>> I noticed when using the make PE, the number of slots allocated from each 
>>> node to the job
>>> corresponded to the number of CPU's and disregarded any additional cores 
>>> within a CPU and
>>> any hyperthreading cores.
>> 
>> For SGE the hyperthreading cores count as normal cores. In principle it's 
>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit the 
>> number of cores for the "make" PE, or (better) limit it in each exechost 
>> defintion to the physical installed ones (this is what I set up usually - 
>> maybe leaving hyperthreading switched on gives some room for the kernel 
>> processes this way).
>> 
>> 
>>> Here are the definitions of the two parallel environments tested (with orte 
>>> always failing when
>>> more slots are requested than there are CPU cores on the first node 
>>> allocated to the job by
>>> SGE):
>>> 
>>> [root@csclprd3 ~]# qconf -sp orte
>>> pe_name            orte
>>> slots              9999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /bin/true
>>> stop_proc_args     /bin/true
>>> allocation_rule    $fill_up
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary TRUE
>>> qsort_args         NONE
>>> 
>>> [root@csclprd3 ~]# qconf -sp make
>>> pe_name            make
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    NONE
>>> stop_proc_args     NONE
>>> allocation_rule    $round_robin
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary TRUE
>>> qsort_args         NONE
>>> 
>>> Although everything seems to work with the make PE, I'd still like
>>> to know why? Because on a much older version of openMPI loaded
>>> on an older version of CentOS, SGE and ROCKS, using all physical
>>> cores, as well as all hyperthreads was never a problem (even on NUMA
>>> nodes).
>>> 
>>> What is the recommended SGE parallel environment definition for
>>> OpenMPI 1.8.2?
>> 
>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer all 
>> your processes on the least amount of machines or spread around in the 
>> cluster. If there is much communication maybe it's better on less machines, 
>> but if each process has heavy I/O to the local scratch disk spreading it 
>> around may be the preferred choice. This doesn't make any difference to Open 
>> MPI, as the generated $PE_HOSTFILE contains just the list of granted slots. 
>> Doing it in an $fill_up style will of course fill the first node including 
>> the hyperthreading ones before moving to the next machine (`man sge_pe`).
>> 
>> -- Reuti
>> 
>> 
>>> I apologize for the length of this, but I thought it best to provide more
>>> information than less.
>>> 
>>> Thank you in advance,
>>> 
>>> -Bill Lane
>>> 
>>> ________________________________________
>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres 
>>> (jsquyres) [jsquy...@cisco.com]
>>> Sent: Friday, August 08, 2014 5:25 AM
>>> To: Open MPI User's List
>>> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
>>> 
>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:
>>> 
>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in 
>>>> addition to
>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
>>>> believe
>>>> the "--mca btl tcp,self" switch limits inter-process communication within 
>>>> a node to using the TCP
>>>> loopback rather than shared memory.
>>> 
>>> Correct.  You will not be using shared memory for MPI communication at all 
>>> -- just TCP.
>>> 
>>>> I should also point out that all of the nodes
>>>> on this cluster feature NUMA architecture.
>>>> 
>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any 
>>>> degraded performance
>>>> issues over using shared memory?
>>> 
>>> Generally yes, but it depends on your application.  If your application 
>>> does very little MPI communication, then the difference between shared 
>>> memory and TCP is likely negligible.
>>> 
>>> I'd strongly suggest two things:
>>> 
>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
>>> - Run your program through a memory-checking debugger such as Valgrind
>>> 
>>> Seg faults like you initially described can be caused by errors in your MPI 
>>> application itself -- the fact that using TCP only (and not shared memory) 
>>> avoids the segvs does not mean that the issue is actually fixed; it may 
>>> well mean that the error is still there, but is happening in a case that 
>>> doesn't seem to cause enough damage to cause a segv.
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25203.php
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED. If you have received this 
> message in error, please notify us immediately by calling (310) 423-6428 and 
> destroy the related message. Thank You for your cooperation.
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25224.php

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Reply via email to