Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Lane, William Tue, 2 Sep 2014 16:49:41 -0400 (EDT)

Ralph,

These latest issues (since 8/28/14) all occurred after we upgraded our cluster
to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather
than tacking on these issues to my existing thread.


-Bill Lane

________________________________________
From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Tuesday, September 02, 2014 11:03 AM
To: Open MPI Users
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request >  28      slots   
(updated findings)

On Sep 2, 2014, at 10:48 AM, Lane, William <william.l...@cshs.org> wrote:

> Ralph,
>
> There are at least three different permutations of CPU configurations in the 
> cluster
> involved. Some are blades that have two sockets with two cores per Intel CPU 
> (and not all
> sockets are filled). Some are IBM x3550 systems having two sockets with three 
> cores
> per Intel CPU (and not all sockets are populated). All nodes have 
> hyperthreading turned
> on as well.
>
> I will look into getting the numactl-devel package installed.
>
> I will try the --bind-to none switch again. For some reason the 
> --hetero-nodes switch wasn't
> recognized by mpirun. Is the --hetero-nodes swtich an MCA parameter?

My bad - I forgot that you are using a very old OMPI version. I think you'll 
need to upgrade, though, as I don't believe something that old will know how to 
handle such a hybrid system. I suspect this may be at the bottom of the problem 
you are seeing.

You'll really need to get up to the 1.8 series, I'm afraid - I'm not sure even 
1.6 can handle this setup.

>
> Thanks for your help.
>
> -Bill Lane
> ________________________________________
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Saturday, August 30, 2014 7:15 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28       slots 
>   (updated findings)
>
> hwloc requires the numactl-devel package in addition to the numactl one
>
> If I understand the email thread correctly, it sounds like you have at least 
> some nodes in your system that have fewer cores than others - is that correct?
>
>>> Here are the definitions of the two parallel environments tested (with orte 
>>> always failing when
>>> more slots are requested than there are CPU cores on the first node 
>>> allocated to the job by
>>> SGE):
>
> If that is the situation, then you need to add --hetero-nodes to your cmd 
> line so we look at the actual topology of every node. Otherwise, for 
> scalability reasons, we only look at the first node in the allocation and 
> assume all nodes are the same.
>
> If that isn't the case, then it sounds like we are seeing fewer cores than 
> exist on the system for some reason. You could try installing hwloc 
> independently, and then running "lstopo" to find out what it detects. Another 
> thing you could do is add "-mca plm_base_verbose 100" to your cmd line (I 
> suggest doing that with just a couple of nodes in your allocation) and that 
> will dump the detected topology to stderr.
>
> I'm surprised the bind-to none option didn't remove the error - it definitely 
> should as we won't be binding when that is given. However, I note that you 
> misspelled it in your reply, so maybe you just didn't type it correctly? It 
> is "--bind-to none" - note the space between the "to" and the "none". You'll 
> take a performance hit, but it should at least run.
>
>
>
> On Aug 29, 2014, at 11:29 PM, Lane, William <william.l...@cshs.org> wrote:
>
>> The --bind-to-none switch didn't help, I'm still getting the same errors.
>>
>> The only NUMA package installed on the nodes in this CentOS 6.2 cluster is 
>> the
>> following:
>>
>> numactl-2.0.7-3.el6.x86_64
>> this package is described as: numactl.x86_64 : Library for tuning for Non 
>> Uniform Memory Access machines
>>
>> Since many of these systems are NUMA systems (with separate memory address 
>> spaces
>> for the sockets) could it be that the correct NUMA libraries aren't 
>> installed?
>>
>> Here are some of the other NUMA packages available for CentOS 6.x:
>>
>> yum search numa | less
>>
>>               Loaded plugins: fastestmirror
>>               Loading mirror speeds from cached hostfile
>>               ============================== N/S Matched: numa 
>> ===============================
>>               numactl-devel.i686 : Development package for building 
>> Applications that use numa
>>               numactl-devel.x86_64 : Development package for building 
>> Applications that use
>>                                    : numa
>>               numad.x86_64 : NUMA user daemon
>>               numactl.i686 : Library for tuning for Non Uniform Memory 
>> Access machines
>>               numactl.x86_64 : Library for tuning for Non Uniform Memory 
>> Access machines
>>
>> -Bill Lane
>> ________________________________________
>> From: users [users-boun...@open-mpi.org] on behalf of Reuti 
>> [re...@staff.uni-marburg.de]
>> Sent: Thursday, August 28, 2014 3:27 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots 
>> (updated findings)
>>
>> Am 28.08.2014 um 10:09 schrieb Lane, William:
>>
>>> I have some updates on these issues and some test results as well.
>>>
>>> We upgraded OpenMPI to the latest version 1.8.2, but when submitting jobs 
>>> via the SGE orte parallel environment received
>>> errors whenever more slots are requested than there are actual cores on the 
>>> first node allocated to the job.
>>
>> Does "-bind-to none" help? The binding is switched on by default in Open MPI 
>> 1.8 onwards.
>>
>>
>>> The btl tcp,self switch passed to mpirun made significant differences in 
>>> performance as per the below:
>>>
>>> Even with the oversubscribe option, the memory mapping errors still 
>>> persist. On 32 core nodes and with HPL run compiled for openmpi/1.8.2,  it 
>>> reliably starts failing at 20 cores allocated. Note that I tested with 'btl 
>>> tcp,self' defined and it does slow down the solve by 2 on a quick solve. 
>>> The results on a larger solve would probably be more dramatic:
>>> - Quick HPL 16 core with SM: ~19GFlops
>>> - Quick HPL 16 core without SM: ~10GFlops
>>>
>>> Unfortunately, a recompiled HPL did not work, but it did give us more 
>>> information (error below). Still trying a couple things.
>>>
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>>
>>> Bind to:     CORE
>>> Node:        csclprd3-0-7
>>> #processes:  2
>>> #cpus:       1
>>>
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>>
>>> When using the SGE make parallel environment to submit jobs everything 
>>> worked perfectly.
>>> I noticed when using the make PE, the number of slots allocated from each 
>>> node to the job
>>> corresponded to the number of CPU's and disregarded any additional cores 
>>> within a CPU and
>>> any hyperthreading cores.
>>
>> For SGE the hyperthreading cores count as normal cores. In principle it's 
>> possible to have an RQS defined in SGE (`qconf -srqsl`) which will limit the 
>> number of cores for the "make" PE, or (better) limit it in each exechost 
>> defintion to the physical installed ones (this is what I set up usually - 
>> maybe leaving hyperthreading switched on gives some room for the kernel 
>> processes this way).
>>
>>
>>> Here are the definitions of the two parallel environments tested (with orte 
>>> always failing when
>>> more slots are requested than there are CPU cores on the first node 
>>> allocated to the job by
>>> SGE):
>>>
>>> [root@csclprd3 ~]# qconf -sp orte
>>> pe_name            orte
>>> slots              9999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /bin/true
>>> stop_proc_args     /bin/true
>>> allocation_rule    $fill_up
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary TRUE
>>> qsort_args         NONE
>>>
>>> [root@csclprd3 ~]# qconf -sp make
>>> pe_name            make
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    NONE
>>> stop_proc_args     NONE
>>> allocation_rule    $round_robin
>>> control_slaves     TRUE
>>> job_is_first_task  FALSE
>>> urgency_slots      min
>>> accounting_summary TRUE
>>> qsort_args         NONE
>>>
>>> Although everything seems to work with the make PE, I'd still like
>>> to know why? Because on a much older version of openMPI loaded
>>> on an older version of CentOS, SGE and ROCKS, using all physical
>>> cores, as well as all hyperthreads was never a problem (even on NUMA
>>> nodes).
>>>
>>> What is the recommended SGE parallel environment definition for
>>> OpenMPI 1.8.2?
>>
>> Whether you prefer $fill_up or $round_robin is up to you - do you prefer all 
>> your processes on the least amount of machines or spread around in the 
>> cluster. If there is much communication maybe it's better on less machines, 
>> but if each process has heavy I/O to the local scratch disk spreading it 
>> around may be the preferred choice. This doesn't make any difference to Open 
>> MPI, as the generated $PE_HOSTFILE contains just the list of granted slots. 
>> Doing it in an $fill_up style will of course fill the first node including 
>> the hyperthreading ones before moving to the next machine (`man sge_pe`).
>>
>> -- Reuti
>>
>>
>>> I apologize for the length of this, but I thought it best to provide more
>>> information than less.
>>>
>>> Thank you in advance,
>>>
>>> -Bill Lane
>>>
>>> ________________________________________
>>> From: users [users-boun...@open-mpi.org] on behalf of Jeff Squyres 
>>> (jsquyres) [jsquy...@cisco.com]
>>> Sent: Friday, August 08, 2014 5:25 AM
>>> To: Open MPI User's List
>>> Subject: Re: [OMPI users] Mpirun 1.5.4  problems when request > 28 slots
>>>
>>> On Aug 8, 2014, at 1:24 AM, Lane, William <william.l...@cshs.org> wrote:
>>>
>>>> Using the "--mca btl tcp,self" switch to mpirun solved all the issues (in 
>>>> addition to
>>>> the requirement to include the --mca btl_tcp_if_include eth0 switch). I 
>>>> believe
>>>> the "--mca btl tcp,self" switch limits inter-process communication within 
>>>> a node to using the TCP
>>>> loopback rather than shared memory.
>>>
>>> Correct.  You will not be using shared memory for MPI communication at all 
>>> -- just TCP.
>>>
>>>> I should also point out that all of the nodes
>>>> on this cluster feature NUMA architecture.
>>>>
>>>> Will using the "--mca btl tcp,self" switch to mpirun result in any 
>>>> degraded performance
>>>> issues over using shared memory?
>>>
>>> Generally yes, but it depends on your application.  If your application 
>>> does very little MPI communication, then the difference between shared 
>>> memory and TCP is likely negligible.
>>>
>>> I'd strongly suggest two things:
>>>
>>> - Upgrade to at least Open MPI 1.6.5 (1.8.x would be better, if possible)
>>> - Run your program through a memory-checking debugger such as Valgrind
>>>
>>> Seg faults like you initially described can be caused by errors in your MPI 
>>> application itself -- the fact that using TCP only (and not shared memory) 
>>> avoids the segvs does not mean that the issue is actually fixed; it may 
>>> well mean that the error is still there, but is happening in a case that 
>>> doesn't seem to cause enough damage to cause a segv.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/24951.php
>>> IMPORTANT WARNING: This message is intended for the use of the person or 
>>> entity to which it is addressed and may contain information that is 
>>> privileged and confidential, the disclosure of which is governed by 
>>> applicable law. If the reader of this message is not the intended 
>>> recipient, or the employee or agent responsible for delivering it to the 
>>> intended recipient, you are hereby notified that any dissemination, 
>>> distribution or copying of this information is STRICTLY PROHIBITED. If you 
>>> have received this message in error, please notify us immediately by 
>>> calling (310) 423-6428 and destroy the related message. Thank You for your 
>>> cooperation.
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/08/25176.php
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25179.php
>> IMPORTANT WARNING: This message is intended for the use of the person or 
>> entity to which it is addressed and may contain information that is 
>> privileged and confidential, the disclosure of which is governed by 
>> applicable law. If the reader of this message is not the intended recipient, 
>> or the employee or agent responsible for delivering it to the intended 
>> recipient, you are hereby notified that any dissemination, distribution or 
>> copying of this information is STRICTLY PROHIBITED. If you have received 
>> this message in error, please notify us immediately by calling (310) 
>> 423-6428 and destroy the related message. Thank You for your cooperation.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/08/25202.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25203.php
> IMPORTANT WARNING: This message is intended for the use of the person or 
> entity to which it is addressed and may contain information that is 
> privileged and confidential, the disclosure of which is governed by 
> applicable law. If the reader of this message is not the intended recipient, 
> or the employee or agent responsible for delivering it to the intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of this information is STRICTLY PROHIBITED. If you have received this 
> message in error, please notify us immediately by calling (310) 423-6428 and 
> destroy the related message. Thank You for your cooperation.
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25224.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/09/25226.php
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation.

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

Reply via email to