Re: [OMPI users] Q: Getting MPI-level memory use from OpenMPI?

2023-04-17 Thread Brian Dobbins via users
Hi George,

  Got it, thanks for the info - I naively hadn't even considered that of
course all the related libraries likely have their *own* allocators.  So,
for *OpenMPI, *it sounds like I can use my own opal_[mc]alloc calls, with a
new build turning mem debugging on, to tally up and report the total size
of OpenMPI allocations, and that seems pretty straightforward.  But I'd
guess that for a data-heavy MPI application, the majority of the memory
will be in transport-level buffers, and that's (for me) likely the UCX
layer, so I should look to that community / code for quantifying how large
those buffers get inside my application?

  Thanks again, and apologies for what is surely a woeful misuse of the
correct terminology here on some of this stuff.

  - Brian


On Mon, Apr 17, 2023 at 11:05 AM George Bosilca  wrote:

> Brian,
>
> OMPI does not have an official mechanism to report how much memory OMPI
> allocates. But, there is hope:
>
> 1. We have a mechanism to help debug memory issues
> (OPAL_ENABLE_MEM_DEBUG). You could enable it and then provide your own
> flavor of memory tracking in opal/util/malloc.c
> 2. You can use a traditional malloc trapping mechanism (valgrind, malt,
> mtrace,...), and investigate the stack to detect where the allocation was
> issued and then count.
>
> The first approach would only give you the memory used by OMPI itself, not
> the other libraries we are using (PMIx, HWLOC, UCX, ...). The second might
> be a little more generic, but depend on external tools and might take a
> little time to setup.
>
> George.
>
>
> On Fri, Apr 14, 2023 at 3:31 PM Brian Dobbins via users <
> users@lists.open-mpi.org> wrote:
>
>>
>> Hi all,
>>
>>   I'm wondering if there's a simple way to get statistics from OpenMPI as
>> to how much memory the *MPI* layer in an application is taking.  For
>> example, I'm running a model and I can get the RSS size at various points
>> in the code, and that reflects the user data for the application, *plus*,
>> surely, buffers for MPI messages that are either allocated at runtime or,
>> maybe, a pool from start-up.  The memory use -which I assume is tied to
>> internal buffers? differs considerably with *how* I run MPI - eg, TCP vs
>> UCX, and with UCX, a UD vs RC mode.
>>
>>   Here's an example of this:
>>
>> 60km (163842 columns), 2304 ranks [OpenMPI]
>> UCX Transport Changes (environment variable)
>> (No recompilation; all runs done on same nodes)
>> Showing memory after ATM-TO-MED Step
>> [RSS Memory in MB]
>>
>> Standard Decomposition
>> UCX_TLS value ud default rc
>> Run 1 347.03 392.08 750.32
>> Run 2 346.96 391.86 748.39
>> Run 3 346.89 392.18 750.23
>>
>>   I'd love a way to trace how much *MPI alone* is using, since here I'm
>> still measuring the *process's* RSS.  My feeling is that if, for
>> example, I'm running on N nodes and have a 1GB dataset + (for the sake of
>> discussion) 100MB of MPI info, then at 2N, with good scaling of domain
>> memory, that's 500MB + 100MB, at 4N it's 250MB/100MB, and eventually, at
>> 16N, the MPI memory dominates.  As a result, when we scale out, even with
>> perfect scaling of *domain* memory, at some point memory associated with
>> MPI will cause this curve to taper off, and potentially invert.  But I'm
>> admittedly *way* out of date on how modern MPI implementations allocate
>> buffers.
>>
>>   In short, any tips on ways to better characterize MPI memory use would
>> be *greatly* appreciated!  If this is purely on the UCX (or other
>> transport) level, that's good to know too.
>>
>>   Thanks,
>>   - Brian
>>
>>
>>


[OMPI users] Q: Getting MPI-level memory use from OpenMPI?

2023-04-14 Thread Brian Dobbins via users
Hi all,

  I'm wondering if there's a simple way to get statistics from OpenMPI as
to how much memory the *MPI* layer in an application is taking.  For
example, I'm running a model and I can get the RSS size at various points
in the code, and that reflects the user data for the application, *plus*,
surely, buffers for MPI messages that are either allocated at runtime or,
maybe, a pool from start-up.  The memory use -which I assume is tied to
internal buffers? differs considerably with *how* I run MPI - eg, TCP vs
UCX, and with UCX, a UD vs RC mode.

  Here's an example of this:

60km (163842 columns), 2304 ranks [OpenMPI]
UCX Transport Changes (environment variable)
(No recompilation; all runs done on same nodes)
Showing memory after ATM-TO-MED Step
[RSS Memory in MB]

Standard Decomposition
UCX_TLS value ud default rc
Run 1 347.03 392.08 750.32
Run 2 346.96 391.86 748.39
Run 3 346.89 392.18 750.23

  I'd love a way to trace how much *MPI alone* is using, since here I'm
still measuring the *process's* RSS.  My feeling is that if, for example,
I'm running on N nodes and have a 1GB dataset + (for the sake of
discussion) 100MB of MPI info, then at 2N, with good scaling of domain
memory, that's 500MB + 100MB, at 4N it's 250MB/100MB, and eventually, at
16N, the MPI memory dominates.  As a result, when we scale out, even with
perfect scaling of *domain* memory, at some point memory associated with
MPI will cause this curve to taper off, and potentially invert.  But I'm
admittedly *way* out of date on how modern MPI implementations allocate
buffers.

  In short, any tips on ways to better characterize MPI memory use would be
*greatly* appreciated!  If this is purely on the UCX (or other transport)
level, that's good to know too.

  Thanks,
  - Brian


Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Brian Dobbins via users
Hi Ralph,

  Thanks again for this wealth of information - we've successfully run the
same container instance across multiple systems without issues, even
surpassing 'native' performance in edge cases, presumably because the
native host MPI is either older or simply tuned differently (eg, 'eager
limit' differences or medium/large message optimizations) than the one in
the container.  That's actually really useful already, and coupled with
what you've already pointed out about the launcher vs the MPI when it comes
to ABI issues, makes me pretty happy.

  But I would love to dive a little deeper on two of the more complex
things you've brought up:

  1) With regards to not including the fabric drivers in the container and
mounting in the device drivers, how many different sets of drivers
*are* there?
I know you mentioned you're not really plugged into the container
community, so maybe this is more a question for them, but I'd think if
there's a relatively small set that accounts for most systems, you might be
able to include them all, and have the dlopen facility find the correct one
at launch?  (Eg, the 'host' launcher could transmit some information as to
which drivers  -like 'mlnx5'- are desired, and it looks for those inside
the container?)

  I definitely agree that mounting in specific drivers is beyond what most
users are comfortable with, so understanding the plans to address this
would be great.  Mind you, for most of our work, even just including the
'inbox' OFED drivers works well enough right now.

  2) With regards to launching multiple processes within one container for
shared memory access, how is this done?  Or is it automatic now with modern
launchers?  Eg, if the launch commands knows it's running 96 copies of the
same container (via either 'host:96' or '-ppn 96' or something), is it
'smart' enough to do this?   This also hasn't been a problem for us, since
we're typically rate-limited by inter-node comms, not intra-node ones, but
it'd be good to ensure we're doing it 'right'.

  Thanks again,
  - Brian


On Thu, Jan 27, 2022 at 10:22 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> Just to complete this - there is always a lingering question regarding
> shared memory support. There are two ways to resolve that one:
>
> * run one container per physical node, launching multiple procs in each
> container. The procs can then utilize shared memory _inside_ the container.
> This is the cleanest solution (i.e., minimizes container boundary
> violations), but some users need/want per-process isolation.
>
> * run one container per MPI process, having each container then mount an
> _external_ common directory to an internal mount point. This allows each
> process to access the common shared memory location. As with the device
> drivers, you typically specify that external mount location when launching
> the container.
>
> Using those combined methods, you can certainly have a "generic" container
> that suffers no performance impact from bare metal. The problem has been
> that it takes a certain degree of "container savvy" to set this up and make
> it work - which is beyond what most users really want to learn. I'm sure
> the container community is working on ways to reduce that burden (I'm not
> really plugged into those efforts, but others on this list might be).
>
> Ralph
>
>
> > On Jan 27, 2022, at 7:39 AM, Ralph H Castain  wrote:
> >
> >> Fair enough Ralph! I was implicitly assuming a "build once / run
> everywhere" use case, my bad for not making my assumption clear.
> >> If the container is built to run on a specific host, there are indeed
> other options to achieve near native performances.
> >>
> >
> > Err...that isn't actually what I meant, nor what we did. You can, in
> fact, build a container that can "run everywhere" while still employing
> high-speed fabric support. What you do is:
> >
> > * configure OMPI with all the fabrics enabled (or at least all the ones
> you care about)
> >
> > * don't include the fabric drivers in your container. These can/will
> vary across deployments, especially those (like NVIDIA's) that involve
> kernel modules
> >
> > * setup your container to mount specified external device driver
> locations onto the locations where you configured OMPI to find them. Sadly,
> this does violate the container boundary - but nobody has come up with
> another solution, and at least the violation is confined to just the device
> drivers. Typically, you specify the external locations that are to be
> mounted using an envar or some other mechanism appropriate to your
> container, and then include the relevant information when launching the
> containers.
> >
> > When OMPI initializes, it will do its normal procedure of attempting to
> load each fabric's drivers, selecting the transports whose drivers it can
> load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to
> build without statically linking in the fabric plugins or else this
> probably 

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Brian Dobbins via users
Hi Ralph,

  Thanks for the explanation - in hindsight, that makes perfect sense,
since each process is operating inside the container and will of course
load up identical libraries, so data types/sizes can't be inconsistent.  I
don't know why I didn't realize that before.  I imagine the past issues I'd
experienced were just due to the PMI differences in the different MPI
implementations at the time.  I owe you a beer or something at the next
in-person SC conference!

  Cheers,
  - Brian


On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

> There is indeed an ABI difference. However, the _launcher_ doesn't have
> anything to do with the MPI library. All that is needed is a launcher that
> can provide the key exchange required to wireup the MPI processes. At this
> point, both MPICH and OMPI have PMIx support, so you can use the same
> launcher for both. IMPI does not, and so the IMPI launcher will only
> support PMI-1 or PMI-2 (I forget which one).
>
> You can, however, work around that problem. For example, if the host
> system is using Slurm, then you could "srun" the containers and let Slurm
> perform the wireup. Again, you'd have to ensure that OMPI was built to
> support whatever wireup protocol the Slurm installation supported (which
> might well be PMIx today). Also works on Cray/ALPS. Completely bypasses the
> IMPI issue.
>
> Another option I've seen used is to have the host system start the
> containers (using ssh or whatever), providing the containers with access to
> a "hostfile" identifying the TCP address of each container. It is then easy
> for OMPI's mpirun to launch the job across the containers. I use this every
> day on my machine (using Docker Desktop with Docker containers, but the
> container tech is irrelevant here) to test OMPI. Pretty easy to set that
> up, and I should think the sys admins could do so for their users.
>
> Finally, you could always install the PMIx Reference RTE (PRRTE) on the
> cluster as that executes at user level, and then use PRRTE to launch your
> OMPI containers. OMPI runs very well under PRRTE - in fact, PRRTE is the
> RTE embedded in OMPI starting with the v5.0 release.
>
> Regardless of your choice of method, the presence of IMPI doesn't preclude
> using OMPI containers so long as the OMPI library is fully contained in
> that container. Choice of launch method just depends on how your system is
> setup.
>
> Ralph
>
>
> On Jan 26, 2022, at 3:17 PM, Brian Dobbins  wrote:
>
>
> Hi Ralph,
>
> Afraid I don't understand. If your image has the OMPI libraries installed
>> in it, what difference does it make what is on your host? You'll never see
>> the IMPI installation.
>>
>
>> We have been supporting people running that way since Singularity was
>> originally released, without any problems. The only time you can hit an
>> issue is if you try to mount the MPI libraries from the host (i.e., violate
>> the container boundary) - so don't do that and you should be fine.
>>
>
>   Can you clarify what you mean here?  I thought there was an ABI
> difference between the various MPICH-based MPIs and OpenMPI, meaning you
> can't use a host's Intel MPI to launch a container's OpenMPI-compiled
> program.  You *can* use the internal-to-the-container OpenMPI to launch
> everything, which is easy for single-node runs but more challenging for
> multi-node ones.  Maybe my understanding is wrong or out of date though?
>
>   Thanks,
>   - Brian
>
>
>
>>
>>
>> On Jan 26, 2022, at 12:19 PM, Luis Alfredo Pires Barbosa <
>> luis_pire...@hotmail.com> wrote:
>>
>> Hi Ralph,
>>
>> My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I
>> am not sure if I the system would work with Intel + OpenMPI.
>>
>> Luis
>>
>> Enviado do Email <https://go.microsoft.com/fwlink/?LinkId=550986>
>> para Windows
>>
>> *De: *Ralph Castain via users 
>> *Enviado:*quarta-feira, 26 de janeiro de 2022 16:01
>> *Para: *Open MPI Users 
>> *Cc:*Ralph Castain 
>> *Assunto: *Re: [OMPI users] OpenMPI - Intel MPI
>>
>> Err...the whole point of a container is to put all the library
>> dependencies _inside_ it. So why don't you just install OMPI in your
>> singularity image?
>>
>>
>>
>> On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users <
>> users@lists.open-mpi.org> wrote:
>>
>> Hello all,
>>
>> I have Intel MPI in my cluster but I am running singularity image of a
>> software which uses OpenMPI.
>>
>> Since they may not be compatible and I dont think it is possible to get
>> these two different MPI running in the system.
>> I wounder if there is some work arround for this issue.
>>
>> Any insight would be welcome.
>> Luis
>>
>>
>>
>


Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Brian Dobbins via users
Hi Ralph,

Afraid I don't understand. If your image has the OMPI libraries installed
> in it, what difference does it make what is on your host? You'll never see
> the IMPI installation.
>

> We have been supporting people running that way since Singularity was
> originally released, without any problems. The only time you can hit an
> issue is if you try to mount the MPI libraries from the host (i.e., violate
> the container boundary) - so don't do that and you should be fine.
>

  Can you clarify what you mean here?  I thought there was an ABI
difference between the various MPICH-based MPIs and OpenMPI, meaning you
can't use a host's Intel MPI to launch a container's OpenMPI-compiled
program.  You *can* use the internal-to-the-container OpenMPI to launch
everything, which is easy for single-node runs but more challenging for
multi-node ones.  Maybe my understanding is wrong or out of date though?

  Thanks,
  - Brian



>
>
> On Jan 26, 2022, at 12:19 PM, Luis Alfredo Pires Barbosa <
> luis_pire...@hotmail.com> wrote:
>
> Hi Ralph,
>
> My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I am
> not sure if I the system would work with Intel + OpenMPI.
>
> Luis
>
> Enviado do Email 
> para Windows
>
> *De: *Ralph Castain via users 
> *Enviado:*quarta-feira, 26 de janeiro de 2022 16:01
> *Para: *Open MPI Users 
> *Cc:*Ralph Castain 
> *Assunto: *Re: [OMPI users] OpenMPI - Intel MPI
>
> Err...the whole point of a container is to put all the library
> dependencies _inside_ it. So why don't you just install OMPI in your
> singularity image?
>
>
>
> On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users <
> users@lists.open-mpi.org> wrote:
>
> Hello all,
>
> I have Intel MPI in my cluster but I am running singularity image of a
> software which uses OpenMPI.
>
> Since they may not be compatible and I dont think it is possible to get
> these two different MPI running in the system.
> I wounder if there is some work arround for this issue.
>
> Any insight would be welcome.
> Luis
>
>
>


Re: [OMPI users] Issues with compilers

2021-01-22 Thread Brian Dobbins via users
As a work-around, but not a 'solution', it's worth pointing out that the
(new) Intel compilers are now *usable* for free - no licensing cost or
login needed.  (As are the MKL, Intel MPI, etc).

Link:
https://software.intel.com/content/www/us/en/develop/tools/oneapi/all-toolkits.html

They've got Yum/Apt repos, tgz files, even Docker images you can use to get
them.  No need for 'root' access to install with the tgz files or use via
Singularity if you have it.   Note that they're easy to install, and even
copy, but *not* legally distributable, and use implies consent to the EULA
on the site.  I've used them to good effect already.  Just download,
install, add to your path, then build OpenMPI with Intel Fortran + Intel
C/C++ and you're good to go, so long as you aren't distributing them.

Cheers,
  - Brian


On Fri, Jan 22, 2021 at 8:12 AM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jan 22, 2021, at 9:49 AM, Alvaro Payero Pinto via users <
> users@lists.open-mpi.org> wrote:
> >
> > I am trying to install Open MPI with Intel compiler suite for the
> Fortran side and GNU compiler suite for the C side. For factors that don’t
> depend upon me, I’m not allowed to change the C compiler suite to Intel one
> since that would mean an additional license.
>
> Yoinks.  I'll say right off that this will be a challenge.
>
> > Problem arises with the fact that the installation should not
> dynamically depend on Intel libraries, so the flag “-static-intel” (or
> similar) should be passed to the Fortran compiler. I’ve seen in the FAQ
> that this problem is solved by passing an Autotools option
> “-Wc,-static-intel” to the variable LDFLAGS when invoking configure with
> Intel compilers. This works if both C/C++ and Fortran compilers are from
> Intel. However, it crashes if the compiler suite is mixed since GNU C/C++
> does not recognise the “-static-intel” option.
>
> The problem is that the same LDFLAGS value is used for all 3 languages (C,
> C++, Fortran), because they can all be compiled into a single application.
> So the Autotools don't separate out different LDFLAGS for the different
> languages.
>
> > Is there any way to bypass this crash and to indicate that such option
> should only be passed when using Fortran compiler?
>
> Keep in mind that there's also two different cases here:
>
> 1. When compiling Open MPI itself
> 2. When compiling MPI applications
>
> You can customize the behavior of the mpifort wrapper compiler by editing
> share/openmpi/mpifort-wrapper-data.txt.
>
> #1 is likely to be a bit more of a challenge.
>
> ...but the thought occurs to me that #2 may be sufficient.  You might want
> to try it and see if your MPI applications have the Intel libraries
> statically linked, and that's enough...?
>
> > Configure call to reproduce the crash is made as follows:
> >
> > ./configure --prefix=/usr/local/ --libdir=/usr/local/lib64/
> --includedir=/usr/local/include/ CC=gcc CXX=g++ 'FLAGS=-O2 -m64'
> 'CFLAGS=-O2 -m64' 'CXXFLAGS=-O2 -m64' FC=ifort 'FCFLAGS=-O2 -m64'
> LDFLAGS=-Wc,-static-intel
>
> The other, slightly more invasive mechanism you could try if #2 is not
> sufficient is to write your own wrapper compiler script that intercepts /
> strips out -Wc,-static-intel for the C and C++ compilers.  For example:
>
> ./configure CC=my_gcc_wrapper.sh CXX=my_g++_wrapper.sh ...
> LDFLAGS=-Wc,-static-intel
>
> Those two scripts are simple shell scripts that strip -Wc,-static-intel if
> it sees it, but otherwise just invoke gcc/g++ with all other contents of $*.
>
> It's a gross hack, but it might work.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>


Re: [OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread Brian Dobbins
Hi Gilles,

  You're right, we no longer get warnings... and the performance disparity
still exists, though to be clear it's only in select parts of the code -
others run as we'd expect.  This is probably why I initially guessed it was
a process/memory affinity issue - the one timer I looked at is in a
memory-intensive part of the code.  Now I'm wondering if we're still
getting issues binding (I need to do a comparison with a local system), or
if it could be due to the cache size differences - the AWS C4 instances
have 25MB/socket, and we have 45MB/socket.  If we fit in cache on our
system, and don't on theirs, that could account for things.  Testing that
is next up on my list, actually.

  Cheers,
  - Brian


On Fri, Dec 22, 2017 at 7:55 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Brian,
>
> i have no doubt this was enough to get rid of the warning messages.
>
> out of curiosity, are you now able to experience performances close to
> native runs ?
> if i understand correctly, the linux kernel allocates memory on the
> closest NUMA domain (e.g. socket if i oversimplify), and since
> MPI tasks are bound by orted/mpirun before they are execv'ed, i have
> some hard time understanding how not binding MPI tasks to
> memory can have a significant impact on performances as long as they
> are bound on cores.
>
> Cheers,
>
> Gilles
>
>
> On Sat, Dec 23, 2017 at 7:27 AM, Brian Dobbins  wrote:
> >
> > Hi Ralph,
> >
> >   Well, this gets chalked up to user error - the default AMI images come
> > without the NUMA-dev libraries, so OpenMPI didn't get built with it (and
> in
> > my haste, I hadn't checked).  Oops.  Things seem to be working correctly
> > now.
> >
> >   Thanks again for your help,
> >   - Brian
> >
> >
> > On Fri, Dec 22, 2017 at 2:14 PM, r...@open-mpi.org 
> wrote:
> >>
> >> I honestly don’t know - will have to defer to Brian, who is likely out
> for
> >> at least the extended weekend. I’ll point this one to him when he
> returns.
> >>
> >>
> >> On Dec 22, 2017, at 1:08 PM, Brian Dobbins  wrote:
> >>
> >>
> >>   Hi Ralph,
> >>
> >>   OK, that certainly makes sense - so the next question is, what
> prevents
> >> binding memory to be local to particular cores?  Is this possible in a
> >> virtualized environment like AWS HVM instances?
> >>
> >>   And does this apply only to dynamic allocations within an instance, or
> >> static as well?  I'm pretty unfamiliar with how the hypervisor
> (KVM-based, I
> >> believe) maps out 'real' hardware, including memory, to particular
> >> instances.  We've seen some parts of the code (bandwidth heavy) run ~10x
> >> faster on bare-metal hardware, though, presumably from memory locality,
> so
> >> it certainly has a big impact.
> >>
> >>   Thanks again, and merry Christmas!
> >>   - Brian
> >>
> >>
> >> On Fri, Dec 22, 2017 at 1:53 PM, r...@open-mpi.org 
> >> wrote:
> >>>
> >>> Actually, that message is telling you that binding to core is
> available,
> >>> but that we cannot bind memory to be local to that core. You can
> verify the
> >>> binding pattern by adding --report-bindings to your cmd line.
> >>>
> >>>
> >>> On Dec 22, 2017, at 11:58 AM, Brian Dobbins 
> wrote:
> >>>
> >>>
> >>> Hi all,
> >>>
> >>>   We're testing a model on AWS using C4/C5 nodes and some of our
> timers,
> >>> in a part of the code with no communication, show really poor
> performance
> >>> compared to native runs.  We think this is because we're not binding
> to a
> >>> core properly and thus not caching, and a quick 'mpirun --bind-to core
> >>> hostname' does suggest issues with this on AWS:
> >>>
> >>> [bdobbins@head run]$ mpirun --bind-to core hostname
> >>>
> >>> 
> --
> >>> WARNING: a request was made to bind a process. While the system
> >>> supports binding the process itself, at least one node does NOT
> >>> support binding memory to the process location.
> >>>
> >>>   Node:  head
> >>>
> >>> Open MPI uses the "hwloc" library to perform process and memory
> >>> binding. This error messag

Re: [OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread Brian Dobbins
Hi Ralph,

  Well, this gets chalked up to user error - the default AMI images come
without the NUMA-dev libraries, so OpenMPI didn't get built with it (and in
my haste, I hadn't checked).  Oops.  Things seem to be working correctly
now.

  Thanks again for your help,
  - Brian


On Fri, Dec 22, 2017 at 2:14 PM, r...@open-mpi.org  wrote:

> I honestly don’t know - will have to defer to Brian, who is likely out for
> at least the extended weekend. I’ll point this one to him when he returns.
>
>
> On Dec 22, 2017, at 1:08 PM, Brian Dobbins  wrote:
>
>
>   Hi Ralph,
>
>   OK, that certainly makes sense - so the next question is, what prevents
> binding memory to be local to particular cores?  Is this possible in a
> virtualized environment like AWS HVM instances?
>
>   And does this apply only to dynamic allocations within an instance, or
> static as well?  I'm pretty unfamiliar with how the hypervisor (KVM-based,
> I believe) maps out 'real' hardware, including memory, to particular
> instances.  We've seen *some* parts of the code (bandwidth heavy) run
> ~10x faster on bare-metal hardware, though, *presumably* from memory
> locality, so it certainly has a big impact.
>
>   Thanks again, and merry Christmas!
>   - Brian
>
>
> On Fri, Dec 22, 2017 at 1:53 PM, r...@open-mpi.org 
> wrote:
>
>> Actually, that message is telling you that binding to core is available,
>> but that we cannot bind memory to be local to that core. You can verify the
>> binding pattern by adding --report-bindings to your cmd line.
>>
>>
>> On Dec 22, 2017, at 11:58 AM, Brian Dobbins  wrote:
>>
>>
>> Hi all,
>>
>>   We're testing a model on AWS using C4/C5 nodes and some of our timers,
>> in a part of the code with no communication, show really poor performance
>> compared to native runs.  We think this is because we're not binding to a
>> core properly and thus not caching, and a quick 'mpirun --bind-to core
>> hostname' does suggest issues with this on AWS:
>>
>> *[bdobbins@head run]$ mpirun --bind-to core hostname*
>>
>> *--*
>> *WARNING: a request was made to bind a process. While the system*
>> *supports binding the process itself, at least one node does NOT*
>> *support binding memory to the process location.*
>>
>> *  Node:  head*
>>
>> *Open MPI uses the "hwloc" library to perform process and memory*
>> *binding. This error message means that hwloc has indicated that*
>> *processor binding support is not available on this machine.*
>>
>>   (It also happens on compute nodes, and with real executables.)
>>
>>   Does anyone know how to enforce binding to cores on AWS instances?  Any
>> insight would be great.
>>
>>   Thanks,
>>   - Brian
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread Brian Dobbins
  Hi Ralph,

  OK, that certainly makes sense - so the next question is, what prevents
binding memory to be local to particular cores?  Is this possible in a
virtualized environment like AWS HVM instances?

  And does this apply only to dynamic allocations within an instance, or
static as well?  I'm pretty unfamiliar with how the hypervisor (KVM-based,
I believe) maps out 'real' hardware, including memory, to particular
instances.  We've seen *some* parts of the code (bandwidth heavy) run ~10x
faster on bare-metal hardware, though, *presumably* from memory locality,
so it certainly has a big impact.

  Thanks again, and merry Christmas!
  - Brian


On Fri, Dec 22, 2017 at 1:53 PM, r...@open-mpi.org  wrote:

> Actually, that message is telling you that binding to core is available,
> but that we cannot bind memory to be local to that core. You can verify the
> binding pattern by adding --report-bindings to your cmd line.
>
>
> On Dec 22, 2017, at 11:58 AM, Brian Dobbins  wrote:
>
>
> Hi all,
>
>   We're testing a model on AWS using C4/C5 nodes and some of our timers,
> in a part of the code with no communication, show really poor performance
> compared to native runs.  We think this is because we're not binding to a
> core properly and thus not caching, and a quick 'mpirun --bind-to core
> hostname' does suggest issues with this on AWS:
>
> *[bdobbins@head run]$ mpirun --bind-to core hostname*
>
> *--*
> *WARNING: a request was made to bind a process. While the system*
> *supports binding the process itself, at least one node does NOT*
> *support binding memory to the process location.*
>
> *  Node:  head*
>
> *Open MPI uses the "hwloc" library to perform process and memory*
> *binding. This error message means that hwloc has indicated that*
> *processor binding support is not available on this machine.*
>
>   (It also happens on compute nodes, and with real executables.)
>
>   Does anyone know how to enforce binding to cores on AWS instances?  Any
> insight would be great.
>
>   Thanks,
>   - Brian
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Q: Binding to cores on AWS?

2017-12-22 Thread Brian Dobbins
Hi all,

  We're testing a model on AWS using C4/C5 nodes and some of our timers, in
a part of the code with no communication, show really poor performance
compared to native runs.  We think this is because we're not binding to a
core properly and thus not caching, and a quick 'mpirun --bind-to core
hostname' does suggest issues with this on AWS:

*[bdobbins@head run]$ mpirun --bind-to core hostname*
*--*
*WARNING: a request was made to bind a process. While the system*
*supports binding the process itself, at least one node does NOT*
*support binding memory to the process location.*

*  Node:  head*

*Open MPI uses the "hwloc" library to perform process and memory*
*binding. This error message means that hwloc has indicated that*
*processor binding support is not available on this machine.*

  (It also happens on compute nodes, and with real executables.)

  Does anyone know how to enforce binding to cores on AWS instances?  Any
insight would be great.

  Thanks,
  - Brian
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Q: Fortran, MPI_VERSION and #defines

2016-03-21 Thread Brian Dobbins
Hi Dave,

With which compiler, and even optimized?
>
>   $ `mpif90 --showme` --version | head -n1
>   GNU Fortran (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)
>   $ cat a.f90
>   use mpi
>   if (mpi_version == 3) call undefined()
>   print *, mpi_version
>   end
>   $ mpif90 a.f90 && ./a.out
>  2
>

No, optimized works, actually - unoptimized is the issue.  I should've
added that in the beginning.  Since MPI_VERSION is a parameter, the
optimizer *knows* the code path won't be used, and thus it doesn't include
it in the binary and life is good.  Compiling with -O0 results in an issue,
however, at least with Intel 15/16 compilers, and I'd guess gfortran
too(?).

So, once again, maybe I'm trying too hard to find a problem.  ;-)

If I don't ever need to build with -O0, *or* simply request the user to
provide the -DMPI3 (or maybe -DMPI2, since that's the less-common one now)
flag, I have no issues.

Yes, not using cmake is definitely better -- people like me should be
> able to build the result!  With autoconf, you could run the above to get
> mpi_version, or adapt it to produce a link-time error for a specific
> version if cross-compiling.  However, you should test for the routines
> you need in the normal autoconf way.  Presumably checking one is enough
> for a particular standard level.
>

Yes, this is an option too.  It's more a mild nuisance in the dissimilarity
of how C/C++ can, by default, process the #defines due to the preprocessor,
whereas Fortran code doesn't have the same flexibility.  Autoconf to
provide the -D flag is an option, indeed.


> I'd hope you could modularize things and select the right modules to
> link using a configure test and possibly automake.  That's probably
> easier for Fortran than the sort of C I encounter.
>

I'm only responsible for a tiny fraction of the code, none of which is the
build process.  I think I'm attempting to over-simplify an already
fairly-simple-but-not-quite-perfect reality.  ;)

  Thanks for the feedback and ideas!

  - Brian


Re: [OMPI users] Q: Fortran, MPI_VERSION and #defines

2016-03-21 Thread Brian Dobbins
Hi Jeff,

On Mon, Mar 21, 2016 at 2:18 PM, Jeff Hammond 
wrote:

> You can consult http://meetings.mpi-forum.org/mpi3-impl-status-Mar15.pdf
> to see the status of all implementations w.r.t. MPI-3 as of one year ago.
>

Thank you - that's something I was curious about, and it's incredibly
helpful.  Some places seem to not update their libraries terribly often,
perhaps for stability/reproducibility reasons, and one of the primary
systems I'm using still has an MPI2 library as the default.  I suspected,
but hadn't known, that MPI3 versions were already widely available.  Anyone
else still have an MPI2 library as the default on their systems?

Calling from C code is another workable but less-than-elegant solution,
since not everyone knows C, even the basics of it, plus it adds a bit of
complexity.  I think maybe I'll just plan on 'expecting' MPI3 and using
macros to tackle the edge-case of MPI2/2.1.

Still, I wish there was an automatic -DMPI_VERSION=30 flag (or something
similar) added implicitly by the MPI command line.  Maybe, since
MPI_VERSION and MPI_SUBVERSION are taken, an MPI_FEATURES one (eg,
-DMPI_FEATURES=30, combining version and subversion)?  I guess it's rarely
needed except in situations where you have new codes on older systems,
though.

Any other perspectives on this?

Cheers,
  - Brian


[OMPI users] Q: Fortran, MPI_VERSION and #defines

2016-03-21 Thread Brian Dobbins
Hi everyone,

  This isn't really a problem, per se, but rather a search for a more
elegant solution.  It also isn't specific to OpenMPI, but I figure the
experience and knowledge of people here made it a suitable place to ask:

  I'm working on some code that'll be used and downloaded by others on
multiple systems, and this code is using some MPI3 features (neighborhood
collectives), but not everyone has the latest MPI libraries, many people
will be running the code on systems without these functions.

  If this were a C/C++ code, it'd be quite easy to deal with this as
'mpi.h' has MPI_VERSION as a #define, so I can use a preprocessor check to
conditionally compile either the neighbor routines or the old
point-to-point routines.  However, Fortran obviously doesn't use #define
natively, and so the mpif.h (or MPI module) simply define MPI_VERSION as a
parameter - I can use it in the code, but not at the preprocessor level.
So, putting the MPI3 neighborhood collective in the code, even in a
non-executed codepath, results in an error when linking with an MPI2
library since the routine isn't found.

  Obviously I can just have the user supply the -DMPI_VERSION=3 flag (or a
different one, since this *is* a parameter name) *if they know* their MPI
is version 3, and I intend to submit a patch to Cmake's FindMPI command to
detect this automatically, but is there a *better* way to do this for
projects that aren't using Cmake?  Scientists don't typically know what
version of MPI they're running, so the more that can be detected and
handled automatically the better.  (Providing stub versions that link
*after* the main library (and thus don't get chosen, I think) also seems
less than elegant.)

  To make it slightly more broad - if new MPI versions outpace library
obsolescence on existing systems, what's the ideal way to write portable
Fortran code that uses the most recent features?  Is there a way to use or
change MPI_VERSION and MPI_SUBVERSION such that they can be used to
conditionally compile code in Fortran built by standard 'Make' processes?
Is 'recommending' that the mpif90/mpif77 commands provide them a terrible,
terrible idea?

Or any other suggestions?

  Thanks,
  - Brian


Re: [OMPI users] MPI-IO Inconsistency over Lustre using OMPI 1.3

2009-03-03 Thread Brian Dobbins
Hi Nathan,

  I just ran your code here and it worked fine - CentOS 5 on dual Xeons w/
IB network, and the kernel is 2.6.18-53.1.14.el5_lustre.1.6.5smp.  I used an
OpenMPI 1.3.0 install compiled with Intel 11.0.081 and, independently, one
with GCC 4.1.2.  I tried a few different times with varying numbers of
processors.

  (Both executables were compiled with -O2)

  I'm sure the main OpenMPI guys will have better ideas, but in the meantime
what kernel, OS and compilers are you using?  And does it happen when you
write to a single OST?  Make a directory and try setting the stripe-size to
1 (eg, lfs setstripe  1048576 0 1' will give you, I think, a
1MB stripe size starting at OST 0 and of size 1.)  I'm just wondering
whether it's something with your hardware, maybe a particular OST, since it
seems to work for me.

  ... Sorry I can't be of more help, but I imagine the regular experts will
chime in shortly.

  Cheers,
  - Brian


On Tue, Mar 3, 2009 at 12:51 PM, Nathan Baca  wrote:

> Hello,
>
> I am seeing inconsistent mpi-io behavior when writing to a Lustre file
> system using open mpi 1.3 with romio. What follows is a simple reproducer
> and output. Essentially one or more of the running processes does not read
> or write the correct amount of data to its part of a file residing on a
> Lustre (parallel) file system.
>
> Any help figuring out what is happening is greatly appreciated. Thanks,
> Nate
>
> program gcrm_test_io
>   implicit none
>   include "mpif.h"
>
>   integer X_SIZE
>
>   integer w_me, w_nprocs
>   integer  my_info
>
>   integer i
>   integer (kind=4) :: ierr
>   integer (kind=4) :: fileID
>
>   integer (kind=MPI_OFFSET_KIND):: mylen
>   integer (kind=MPI_OFFSET_KIND):: offset
>   integer status(MPI_STATUS_SIZE)
>   integer count
>   integer ncells
>   real (kind=4), allocatable, dimension (:) :: array2
>   logical sync
>
>   call mpi_init(ierr)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr)
>   call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr)
>
>   call mpi_info_create(my_info, ierr)
> ! optional ways to set things in mpi-io
> ! call mpi_info_set   (my_info, "romio_ds_read" , "enable"   , ierr)
> ! call mpi_info_set   (my_info, "romio_ds_write", "enable"   , ierr)
> ! call mpi_info_set   (my_info, "romio_cb_write", "enable", ierr)
>
>   x_size = 410011  ! A 'big' number, with bigger numbers it is more
> likely to fail
>   sync = .true.  ! Extra file synchronization
>
>   ncells = (X_SIZE * w_nprocs)
>
> !  Use node zero to fill it with nines
>   if (w_me .eq. 0) then
>   call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
> MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr)
>   allocate (array2(ncells))
>   array2(:) = 9.0
>   mylen = ncells
>   offset = 0 * 4
>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>   call MPI_File_write(fileID, array2, mylen , MPI_REAL,
> status,ierr)
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong initial write count:",
> count,mylen
>   deallocate(array2)
>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>   call MPI_FILE_CLOSE (fileID,ierr)
>   endif
>
> !  All nodes now fill their area with ones
>   call MPI_BARRIER(MPI_COMM_WORLD,ierr)
>   allocate (array2( X_SIZE))
>   array2(:) = 1.0
>   offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4
>   mylen = X_SIZE
>   call MPI_FILE_OPEN  (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY,
> my_info, fileID, ierr)
>   print*,"node",w_me,"starting",(offset/4) +
> 1,"ending",(offset/4)+mylen
>   call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL,
> "native",MPI_INFO_NULL,ierr)
>   call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr)
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me
>   deallocate(array2)
>   if (sync) call MPI_FILE_SYNC (fileID,ierr)
>   call MPI_FILE_CLOSE (fileID,ierr)
>
> !  Read it back on node zero to see if it is ok data
>   if (w_me .eq. 0) then
>   call MPI_FILE_OPEN  (MPI_COMM_SELF, "output.dat",
> MPI_MODE_RDONLY, my_info, fileID, ierr)
>   mylen = ncells
>   allocate (array2(ncells))
>   call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr)
>
>   call MPI_Get_count(status,MPI_INTEGER, count, ierr)
>   if (count .ne. mylen) print*, "Wrong read count:", count,mylen
>   do i=1,ncells
>if (array2(i) .ne. 1) then
>   print*, "ERROR", i,array2(i), ((i-1)*4),
> ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB
>   goto 999
>end if
>   end do
>   print*, "All done 

Re: [OMPI users] Problem with feupdateenv

2008-12-07 Thread Brian Dobbins
Hi Sangamesh,

  I think the problem is that you're loading a different version of OpenMPI
at runtime:

*[master:17781] [ 1] /usr/lib64/openmpi/libmpi.so.0 [0x34b19544b8]*

  .. The path there is to '/usr/lib64/openmpi', which is probably a
system-installed GCC version.  You want to use your version in:

* /opt/openmpi_intel/1.2.8/*

  You probably just need to re-set your LD_LIBRARY_PATH environment variable
to reflect this new path, such as:
*
(for bash)
export LD_LIBRARY_PATH=/opt/openmpi_intel/1.2.8/lib:${LD_LIBRARY_PATH}*

  ... By doing this, it should find the proper library files (assuming
that's the directory they're in - check your instal!).  You may also wish to
remove the old version of OpenMPI that came with the system - a yum 'list'
command should show you the package, and then just remove it.  The
'feupdateenv' thing is more of a red herring, I think... this happens (I
think!) because the system uses a Linux version of the library instead of an
Intel one.  You can add the flag '-shared-intel' to your compile flags or
command line and that should get rid of that, if it bugs you.  Someone else
can, I'm sure, explain in far more detail what the issue there is.

  Hope that helps.. if not, post the output of 'ldd hellompi' here, as well
as an 'ls /opt/openmpi_intel/1.2.8/'

  Cheers!
  - Brian



On Sun, Dec 7, 2008 at 9:50 AM, Sangamesh B  wrote:

> Hello all,
>
> Installed Open MPI 1.2.8 with Intel C++compilers on Cent OS 4.5 based
> Rocks 4.3 linux cluster (& Voltaire infiniband). Installation was
> smooth.
>
> The following error occurred during compilation:
>
> # mpicc hellompi.c -o hellompi
> /opt/intel/cce/10.1.018/lib/libimf.so: warning: warning: feupdateenv
> is not implemented and will always fail
>
> It produced the executable. But during execution it failed with
> Segmentation fault:
>
>  # which mpirun
> /opt/openmpi_intel/1.2.8/bin/mpirun
> # mpirun -np 2 ./hellompi
> ./hellompi: Symbol `ompi_mpi_comm_world' has different size in shared
> object, consider re-linking
> ./hellompi: Symbol `ompi_mpi_comm_world' has different size in shared
> object, consider re-linking
> [master:17781] *** Process received signal ***
> [master:17781] Signal: Segmentation fault (11)
> [master:17781] Signal code: Address not mapped (1)
> [master:17781] Failing at address: 0x10
> [master:17781] [ 0] /lib64/tls/libpthread.so.0 [0x34b150c4f0]
> [master:17781] [ 1] /usr/lib64/openmpi/libmpi.so.0 [0x34b19544b8]
> [master:17781] [ 2]
> /usr/lib64/openmpi/libmpi.so.0(ompi_proc_init+0x14d) [0x34b1954cfd]
> [master:17781] [ 3] /usr/lib64/openmpi/libmpi.so.0(ompi_mpi_init+0xba)
> [0x34b19567da]
> [master:17781] [ 4] /usr/lib64/openmpi/libmpi.so.0(MPI_Init+0x94)
> [0x34b1977ab4]
> [master:17781] [ 5] ./hellompi(main+0x44) [0x401c0c]
> [master:17781] [ 6] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x34b0e1c3fb]
> [master:17781] [ 7] ./hellompi [0x401b3a]
> [master:17781] *** End of error message ***
> [master:17778] [0,0,0]-[0,1,1] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> mpirun noticed that job rank 0 with PID 17781 on node master exited on
> signal 11 (Segmentation fault).
> 1 additional process aborted (not shown)
>
> But this is not the case, during non-mpi c code compilation or execution.
>
> # icc sample.c -o sample
> # ./sample
>
> Compiler is working
> #
>
> What might be the reason for this & how it can be resolved?
>
> Thanks,
> Sangamesh
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Performance: MPICH2 vs OpenMPI

2008-10-10 Thread Brian Dobbins
Hi guys,

On Fri, Oct 10, 2008 at 12:57 PM, Brock Palen  wrote:

> Actually I had a much differnt results,
>
> gromacs-3.3.1  one node dual core dual socket opt2218  openmpi-1.2.7
>  pgi/7.2
> mpich2 gcc
>

   For some reason, the difference in minutes didn't come through, it seems,
but I would guess that if it's a medium-large difference, then it has its
roots in PGI7.2 vs. GCC rather than MPICH2 vs. OpenMPI.  Though, to be fair,
I find GCC vs. PGI (for C code) is often a toss-up - one may beat the other
handily on one code, and then lose just as badly on another.

I think my install of mpich2 may be bad, I have never installed it before,
>  only mpich1, OpenMPI and LAM. So take my mpich2 numbers with salt, Lots of
> salt.


  I think the biggest difference in performance with various MPICH2 install
comes from differences in the 'channel' used..  I tend to make sure that I
use the 'nemesis' channel, which may or may not be the default these days.
If not, though, most people would probably want it.  I think it has issues
with threading (or did ages ago?), but I seem to recall it being
considerably faster than even the 'ssm' channel.

  Sangamesh:  My advice to you would be to recompile Gromacs and specify, in
the *Gromacs* compile / configure, to use the same CFLAGS you used with
MPICH2.  Eg, "-O2 -m64", whatever.  If you do that, I bet the times between
MPICH2 and OpenMPI will be pretty comparable for your benchmark case -
especially when run on a single processor.

  Cheers,
  - Brian


Re: [OMPI users] Performance: MPICH2 vs OpenMPI

2008-10-09 Thread Brian Dobbins
On Thu, Oct 9, 2008 at 10:13 AM, Jeff Squyres  wrote:

> On Oct 9, 2008, at 8:06 AM, Sangamesh B wrote:
>
>> OpenMPI : 120m 6s
>> MPICH2 :  67m 44s
>>
>
> That seems to indicate that something else is going on -- with -np 1, there
> should be no MPI communication, right?  I wonder if the memory allocator
> performance is coming into play here.


  I'd be more inclined to double-check how the Gromacs app is being compiled
in the first place - I wouldn't think the OpenMPI memory allocator would
make anywhere near that much difference.  Sangamesh, do you know what
command line was used to compile both of these?  Someone correct me if I'm
wrong, but *if* MPICH2 embeds optimization flags in the 'mpicc' command and
OpenMPI does not, then if he's not specifying any optimization flags in the
compilation of Gromacs, MPICH2 will pass its embedded ones on to the Gromacs
compile and be faster.  I'm rusty on my GCC, too, though - does it default
to an O2 level, or does it default to no optimizations?

  Since the benchmark is readily available, I'll try running it later
today.. didn't get a chance last night.

  Cheers,
  - Brian


Re: [OMPI users] Performance: MPICH2 vs OpenMPI

2008-10-08 Thread Brian Dobbins
Hi guys,

[From Eugene Loh:]

> OpenMPI - 25 m 39 s.
>> MPICH2  -  15 m 53 s.
>>
> With regards to your issue, do you have any indication when you get that
> 25m39s timing if there is a grotesque amount of time being spent in MPI
> calls?  Or, is the slowdown due to non-MPI portions?


  Just to add my two cents: if this job *can* be run on less than 8
processors (ideally, even on just 1), then I'd recommend doing so.  That is,
run it with OpenMPI and with MPICH2 on 1, 2 and 4 processors as well.  If
the single-processor jobs still give vastly different timings, then perhaps
Eugene is on the right track and it comes down to various computational
optimizations and not so much the message-passing that's make a difference.
Timings from 2 and 4 process runs might be interesting as well to see how
this difference changes with process counts.

  I've seen differences between various MPI libraries before, but nothing
quite this severe either.  If I get the time, maybe I'll try to set up
Gromacs tonight -- I've got both MPICH2 and OpenMPI installed here and can
try to duplicate the runs.   Sangamesh, is this a standard benchmark case
that anyone can download and run?

  Cheers,
  - Brian


Brian Dobbins
Yale Engineering HPC


[OMPI users] Q: OpenMPI's use of /tmp and hanging apps via FS problems?

2008-08-16 Thread Brian Dobbins
Hi guys,

  I was hoping someone here could shed some light on OpenMPI's use of /tmp
(or, I guess, TMPDIR) and save me from diving into the source.. ;)

  The background is that I'm trying to run some applications on a system
which has a flaky parallel file system which TMPDIR is mapped to - so, on
start up, OpenMPI creates it's 'openmpi-sessions-' directory there,
and under that, a few files.  Sometimes I see 1 subdirectory (usually a 0),
sometimes a 0 and a 1, etc.  In one of these, sometimes I see files such as
'shared_memory_pool.', and 'shared_memory_module.'.

  My questions are, one, what are the various numbers / files for?  (If
there's a write-up on this somewhere, just point me towards it!)

  And two, the real question, are these 'files' used during runtime, or only
upon startup / shutdown?  I'm having issues with various codes, especially
those heavy on messages and I/O, failing to complete a run, and haven't
resorted to sifting through strace's output yet.  This doesn't happen all
the time, but I've seen it happen reliably now with one particular code -
it's success rate (it DOES succeed sometimes) is about 25% right now.  My
best guess is that this is because the file system is overloaded, thus not
allowing timely I/O or access to OpenMPI's files, but I wanted to get a
quick understanding of how these files are used by OpenMPI and whether the
FS does indeed seem a likely culprit before going with that theory for sure.

  Thanks very much,
  - Brian


Brian Dobbins
Yale Engineering HPC


Re: [OMPI users] Problem with WRF and pgi-7.2

2008-07-23 Thread Brian Dobbins
Hi Brock,

  Just to add my two cents now, I finally got around to building WRF with
PGI 7.2 as well.  I noticed that in the configure script there isn't an
option specifically for PGI (Fortran) + PGI (C), and when I try that
combination I do get the same error you have - I'm doing this on RHEL5.2,
with PGI 7.2-2.  There *is* a 7.2-3 out that I haven't tried yet, but they
don't mention anything about this particular error in the fixes section of
their documentation, so I'm guessing they haven't come across it yet.

  .. In the meantime, you *can* successfully build WRF with a PGI (Fortran)
+ GCC (C) OpenMPI install.  I just did that, and tested it with one case,
using OpenMPI-1.2.6, PGI 7.2-2 and GCC 4.1.2 on the same RHEL 5.2 system.

  In a nutshell, I'm pretty sure it's a PGI problem.  If you want to post it
in their forums, they're generally pretty responsive. (And if you don't, I
will, since it'd be nice to see it work without a hybrid MPI installation!)

  Cheers,
  - Brian


Brian Dobbins
Yale Engineering HPC


On Wed, Jul 23, 2008 at 12:09 PM, Brock Palen  wrote:

> Not yet, if you have no ideas I will open a ticket.
>
> Brock Palen
> www.umich.edu/~brockp <http://www.umich.edu/%7Ebrockp>
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
>
>
>
> On Jul 23, 2008, at 12:05 PM, Jeff Squyres wrote:
>
>> Hmm; I haven't seen this kind of problem before.  Have you contacted PGI?
>>
>>
>> On Jul 21, 2008, at 2:08 PM, Brock Palen wrote:
>>
>>  Hi, When compiling WRF with PGI-7.2-1  with openmpi-1.2.6
>>> The file buf_for_proc.c  fails.  Nothing specail about this file sticks
>>> out to me.  But older versions of PGI like it just fine.  The errors PGI
>>> complains about has to do with mpi.h though:
>>>
>>> [brockp@nyx-login1 RSL_LITE]$ mpicc  -DFSEEKO64_OK  -w -O3 -DDM_PARALLEL
>>>   -c buf_for_proc.c
>>> PGC-S-0036-Syntax error: Recovery attempted by inserting identifier
>>> .Z before '(' (/home/software/rhel4/openmpi-1.2.6/pgi-7.0/include/mpi.h:
>>> 823)
>>> PGC-S-0082-Function returning array not allowed
>>> (/home/software/rhel4/openmpi-1.2.6/pgi-7.0/include/mpi.h: 823)
>>> PGC-S-0043-Redefinition of symbol, MPI_Comm
>>> (/home/software/rhel4/openmpi-1.2.6/pgi-7.0/include/mpi.h: 837)
>>> PGC/x86-64 Linux 7.2-1: compilation completed with severe errors
>>>
>>> Has anyone else seen that kind of problem with mpi.h  and pgi?  Do I need
>>> to use -c89  ?  I know PGI changed the default with this a while back, but
>>> it does not appear to help.
>>>
>>> Thanks!
>>>
>>>
>>> Brock Palen
>>> www.umich.edu/~brockp <http://www.umich.edu/%7Ebrockp>
>>> Center for Advanced Computing
>>> bro...@umich.edu
>>> (734)936-1985
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Bug in oob_tcp_[in|ex]clude?

2007-12-17 Thread Brian Dobbins
Hi Marco and Jeff,

  My own knowledge of OpenMPI's internals is limited, but I thought I'd add
my less-than-two-cents...

> I've found only a way in order to have tcp connections binded only to
> > the eth1 interface, using both the following MCA directives in the
> > command line:
> >
> > mpirun  --mca oob_tcp_include eth1 --mca oob_tcp_include
> > lo,eth0,ib0,ib1 .
> >
> > This sounds me as bug.
>
> Yes, it does.  Specifying the MCA same param twice on the command line
> results in undefined behavior -- it will only take one of them, and I
> assume it'll take the first (but I'd have to check the code to be sure).


  I *think* that Marco intended to write:
  mpirun  --mca oob_tcp_include eth1 --mca oob_tcp_exclude
lo,eth0,ib0,ib1 ...

  Is this correct?  So you're not specifying include twice, you're
specifying include *and* exclude, so each interface is explicitly stated in
one list or the other.  I remember encountering this behaviour as well, in a
slightly different format, but I can't seem to reproduce it now either.
That said, with these options, won't the MPI traffic (as opposed to the OOB
traffic) still use the eth1,ib0 and ib1 interfaces?  You'd need to add '-mca
btl_tcp_include eth1' in order to say it should only go over that NIC, I
think.

  As for the 'connection errors', two bizarre things to check are, first,
that all of your nodes using eth1 actually have correct /etc/hosts mappings
to the other nodes.  One system I ran on had this problem when some nodes
had an IP address for node002 as one thing, and another node had node002's
IP address as something different.   This should be easy enough by trying to
run on one node first, then two nodes that you're sure have the correct
addresses.

  .. The second situation is if you're launching an MPMD program.  Here, you
need to use '-gmca ' instead of '-mca '.

  Hope some of that is at least a tad useful.  :)

  Cheers,
  - Brian


Re: [OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)

2007-12-05 Thread Brian Dobbins
Hi Josh,

I believe the problem is that you are only applying the MCA
> parameters to the first app context instead of all of them:


  Thank you *very* much.. applying the parameters with -gmca works fine with
the test case (and I'll try the actual one soon).   However and this is
minor since it works with method (1),...


> There are two main ways of doing this:

2) Alternatively you can duplicate the MCA parameters for each app context:


  .. This actually doesn't work.  I had thought of that and tried it, and I
still get the same connection problems.  I just rechecked this again to be
sure.

  Again, many thanks for the help!

  With best wishes,
  - Brian


Brian Dobbins
Yale University HPC


Re: [OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)

2007-12-05 Thread Brian Dobbins
  As a quick follow-up to my own post, I just tried this on a few other
systems:

1) One system, on which the nodes have only one ethernet device, running the
code with the split "-np" arguments works fine.
2) Another system, which has IB links (as default), runs the code fine.
3) Two very similar systems, each with two ethernet devices on each node
(hence the mca parameters), and on both of these systems the code does
*not*work, giving the connection errors shown earlier.

  I'll try a few more things tomorrow, but I have to imagine other people
have seen this, or I'm just missing a crucial mca parameter?

  Thanks very much,
  - Brian


Brian Dobbins
Yale University HPC


[OMPI users] Q: Problems launching MPMD applications? ('mca_oob_tcp_peer_try_connect' error 103)

2007-12-05 Thread Brian Dobbins
Hi guys,

  I seem to have encountered an error while trying to run an MPMD executable
through Open MPI's '-app' option, and I'm wondering if anyone else has seen
this or can verify this?

Backing up to a simple example, running a "hello world" executable (hwc.exe)
works fine when run as:  (using an interactive PBS session with -l
nodes=2:ppn=4)
 mpiexec -v -d  -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0 -mca
pls_rsh_agent ssh -np 8 ./hwc.exe

But when I run what should be the same thing via an '--app' file (or implied
command line) liks the following fails:
 mpiexec -v -d  -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0 -mca
pls_rsh_agent ssh  -np 6 ./hwc.exe : -np 2 ./hwc.exe

  My understanding is that these are equivalent, no?  But the latter example
fails with multiple "Software caused connection abort (103)" errors, such as
the following:
[xxx:13978] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
xx.x.2.81:34103 failed: Software caused connection abort (103)

  Any thoughts?  I haven't dug around the source yet since this could be a
weird problem with the system I'm using.  For the record, this is with
OpenMPI 1.2.4 compiled with PGI 7.1-2.

  As an aside, the '-app' syntax DOES work fine when all copies are running
on the same node.. for example, having requested 4 CPUs per node, if I run
"-np 2 ./hwc.exe : -np 2 ./hwc.exe", it works fine.  And I did also try
duplicating the mca parameters after the colon since I figured they might
not propagate, thus perhaps it was trying to use the wrong interface, but
that didn't help either.

  Thanks very much,
  - Brian


Brian Dobbins
Yale University HPC


Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brian Dobbins


Hi Andrew, Brock, and everyone else,

Andrew Friedley wrote:

If this is what I think it is, try using this MCA parameter:
-mca btl_openib_ib_timeout 20
  
 Just FYI, in addition to the above, I retried using the gigabit links 
('--mca btl tcp,self', right?) and that failed too, so at least in /my/ 
case, it isn't a problem related to the IB fabrics.  I'm recompiling 
OpenMPI-1.2.4 with PGI-6.2-5 right now, and recompiling CCSM with this 
will take an hour or two, but I'll send a status update after that.  I'm 
98% certain that that configuration has worked before on a 32-bit Xeon 
with gigabit links, so while there are still lots of variables, it 
should help me narrow things down.


 Cheers,
 - Brian


Brian Dobbins
Yale University HPC



Re: [OMPI users] OpenIB problems

2007-11-21 Thread Brian Dobbins


Hi Brock
We have a user whos code keep failing at a similar point in the  
code.  The errors (below) would make me think its a fabric problem,  
but ibcheckerrors is not returning any issues.  He is using  
openmpi-1.2.0  With OFED on RHEL4,
  
 Strangely enough, I hit this exact problem about half an hour ago... 
what compilers is he using for the code / OpenMPI?  I haven't narrowed 
down the cause yet because the system I'm on is a tad, uh, disheveled, 
but it'd be good to find any commonality.  I'm using PGI-7.1-2 
(pgf77/pgf90) with OpenMPI-1.2.4.  The system also happens to be RHEL 4 
(Update 3).


 .. Also, the code I'm running is CCSM, and it gave an error message 
about being unable to read a file correctly right before my 
synchronization.  This code has worked on other systems in the past 
(non-IB, non-IBRIX), but something as basic as a file write shouldn't be 
adversely affected by such things, hence I'm going to try backing the 
compiler down to a 'known-good' one first., since perhaps that's my 
problem.  I don't suppose you saw any messages of that sort?   I did 
already try setting the retry count parameter up to 20 (from 7), but 
that didn't fix it.


 Cheers,
 - Brian


Brian Dobbins
Yale University HPC



Re: [OMPI users] [Fwd: MPI question/problem] including code attachments

2007-06-27 Thread Brian Dobbins
Hi guys,

  I just came across this thread while googling when I faced a similar
problem with a certain code - after scratching my head for a bit, it
turns out the solution is pretty simple. My guess is that Jeff's code
has it's own copy of 'mpif.h' in its source directory, and in all
likelihood, it's an MPICH version of mpif.h.  Delete it, recompile,
(OpenMPI by default will look for mpif.h in the $(install)/include
directory), and you should be able to run fine.

  Cheers,
  - Brian


Yale Engineering HPC


   -- original message follows --

Hello All,
This will probably turn out to be my fault as I haven't used MPI in a
few years.

I am attempting to use an MPI implementation of a "nxtval" (see the MPI
book). I am using the client-server scenario. The MPI book specifies the
three functions required. Two are collective and one is not. Only the
two collectives are tested in the supplied code. All three of the MPI
functions are reproduced in the attached code, however. I wrote a tiny
application to create and free a counter object and it fails.

I need to know if this is a bug in the MPI book and a misunderstanding
on my part.

The complete code is attached. I was using openMPI/intel to compile and
run.

The error I get is:

> [compute-0-1.local:22637] *** An error occurred in MPI_Comm_rank
> [compute-0-1.local:22637] *** on communicator MPI_COMM_WORLD
> [compute-0-1.local:22637] *** MPI_ERR_COMM: invalid communicator
> [compute-0-1.local:22637] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpirun noticed that job rank 0 with PID 22635 on node
> "compute-0-1.local" exited on signal 15.

I've attempted to google my way to understanding but with little
success. If someone could point me to
a sample application that actually uses these functions, I would
appreciate it.

Sorry if this is the wrong list, it is not an MPICH question and I
wasn't sure where to turn.
Thanks,
--jeff