Re: [OMPI users] ucx configuration

2023-01-11 Thread Dave Love via users
Gilles Gouaillardet via users  writes:

> Dave,
>
> If there is a bug you would like to report, please open an issue at
> https://github.com/open-mpi/ompi/issues and provide all the required
> information
> (in this case, it should also include the UCX library you are using and how
> it was obtained or built).

There are hundreds of failures I was interested in resolving with the
latest versions, though I think somewhat fewer than with previous UCX
versions.

I'd like to know how it's recommended I should build to ensure I'm
starting from the right place for any investigation.  Possible interplay
between OMPI and UCX options seems worth understanding specifically, and
I think it's reasonable to ask how to configure things to work together
generally, when there are so many options without much explanation.

I have tried raising issues previously without much luck but, given the
number of failures, something is fundamentally wrong, and I doubt you
want the output from the whole set.

Perhaps the MPICH test set in a "portable" configuration is expected to
fail with OMPI for some reason, and someone can comment on that.
However, it's the only comprehensive set I know is available, and
originally even IMB crashed, so I'm not inclined to blame the tests
initially, and wonder how this stuff is tested.

[OMPI users] ucx configuration

2023-01-05 Thread Dave Love via users
I see assorted problems with OMPI 4.1 on IB, including failing many of
the mpich tests (non-mpich-specific ones) particularly with RMA.  Now I
wonder if UCX build options could have anything to do with it, but I
haven't found any relevant information.

What configure options would be recommended with CUDA and ConnectX-5 IB?
(This is on POWER, but I presume that's irrelevant.)  I assume they
should be at least

--enable-cma --enable-mt --with-cuda --with-gdrcopy --with-verbs --with-mlx5-dv

but for a start I don't know what the relationship is between the cuda,
shared memory, and multi-threading options in OMPI and UCX.

Thanks for any enlightenment.

Re: [OMPI users] vectorized reductions

2021-07-20 Thread Dave Love via users
Gilles Gouaillardet via users  writes:

> One motivation is packaging: a single Open MPI implementation has to be
> built, that can run on older x86 processors (supporting only SSE) and the
> latest ones (supporting AVX512).

I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.

> The op/avx component will select at
> runtime the most efficient implementation for vectorized reductions.

It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.


[OMPI users] vectorized reductions

2021-07-19 Thread Dave Love via users
I meant to ask a while ago about vectorized reductions after I saw a
paper that I can't now find.  I didn't understand what was behind it.

Can someone explain why you need to hand-code the avx implementations of
the reduction operations now used on x86_64?  As far as I remember, the
paper didn't justify the effort past alluding to a compiler being unable
to vectorize reductions.  I wonder which compiler(s); the recent ones
I'm familiar with certainly can if you allow them (or don't stop them --
icc, sigh).  I've been assured before that GCC can't, but that's
probably due to using the default correct FP compilation and/or not
restricting function arguments.  So I wonder what's the problem just
using C and a tolerably recent GCC if necessary -- is there something
else behind this?

Since only x86 is supported, I had a go on ppc64le and with minimal
effort saw GCC vectorizing more of the base implementation functions
than are included in the avx version.  Similarly for x86
micro-architectures.  (I'd need convincing that avx512 is worth the
frequency reduction.)  It would doubtless be the same on aarch64, say,
but I only have the POWER.

Thanks for any info.


Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-18 Thread Dave Love via users
"Gabriel, Edgar via users"  writes:

>> How should we know that's expected to fail?  It at least shouldn't fail like 
>> that; set_atomicity doesn't return an error (which the test is prepared for 
>> on a filesystem like pvfs2).  
>> I assume doing nothing, but appearing to, can lead to corrupt data, and I'm 
>> surprised that isn't being seen already.
>> HDF5 requires atomicity -- at least to pass its tests -- so presumably 
>> anyone like us who needs it should use something mpich-based with recent or 
>> old romio, and that sounds like most general HPC systems.  
>> Am I missing something?
>> With the current romio everything I tried worked, but we don't get that 
>> option with openmpi.
>
> First of all, it is mentioned on the FAQ sites of Open MPI, although
> admittedly it is not entirely update (it lists external32 support also
> as missing, which is however now available since 4.1).

Yes, the FAQ was full of confusing obsolete material when I last looked.
Anyway, users can't be expected to check whether any particular
operation is expected to fail silently.  I should have said that
MPI_File_set_atomicity(3) explicitly says the default is true for
multiple nodes, and doesn't say the call is a no-op with the default
implementation.  I don't know whether the MPI spec allows not
implementing it, but I at least expect an error return if it doesn't.
As far as I remember, that's what romio does on a filesystem like pvfs2
(or lustre when people know better than implementers and insist on
noflock); I mis-remembered from before, thinking that ompio would be
changed to do the same.  From that thread, I did think atomicity was on
its way.

Presumably an application requests atomicity for good reason, and can
take appropriate action if the status indicates it's not available on
that filesystem.

> You don't need atomicity for the HDF5 tests, we are passing all of them to 
> the best my knowledge, and this is one of the testsuites that we do run 
> regularly as part of our standard testing process.

I guess we're just better at breaking things.

> I am aware that they have an atomicity test -  which we pass for whatever 
> reason. This highlight also btw the issue(s) that I am having with the 
> atomicity option in MPI I/O. 

I don't know what the application is of atomicity in HDF5.  Maybe it
isn't required for typical operations, but I assume it's not used
blithely.  However, I'd have thought HDF5 should be prepared for
something like pvfs2, and at least not abort the test at that stage.

I've learned to be wary of declaring concurrent systems working after a
few tests.  In fact, the phdf5 test failed for me like this when I tried
across four lustre client nodes with 4.1's defaults.  (I'm confused
about the striping involved, because I thought I set it to four, and now
it shows as one on that directory.)

  ...
  Testing  -- dataset atomic updates (atomicity) 
  Proc 9: *** Parallel ERRProc 54: *** Parallel ERROR ***
  VRFY (H5Sset_hyperslab succeeded) failed at line 4293 in t_dset.c
  aborting MPI proceProc 53: *** Parallel ERROR ***

Unfortunately I hadn't turned on backtracing, and I wouldn't get another
job trough for a while.

> The entire infrastructure to enforce atomicity is actually in place in ompio, 
> and I can give you the option on how to enforce strict atomic behavior for 
> all files in ompio (just not on a per file basis), just be aware that the 
> performance will nose-dive. This is not just the case with ompio, but also in 
> romio, you can read up on that various discussion boards on that topic, look 
> at NFS related posts (where you need the atomicity for correctness in 
> basically all scenarios).

I'm fairly sure I accidentally ran tests successfully on NFS4, at least
single-node.  I never found a good discussion of the topic, and what I
have seen about "NFS" was probably specific to NFS3 and non-POSIX
compliance, though I don't actually care about parallel i/o on NFS.  The
information we got about lustre was direct from Rob Latham, as nothing
showed up online.

I don't like fast-but-wrong, so I think there should be the option of
correctness, especially as it's the documented default.

> Just as another data point, in the 8+ years that ompio has been available, 
> there was not one issue reported related to correctness due to missing the 
> atomicity option.

Yes, I forget some history over the years, like that one on a local
filesystem:
.

> That being said, if you feel more comfortable using romio, it is completely 
> up to you. Open MPI offers this option, and it is incredibly easy to set the 
> default parameters on a  platform for all users such that romio is being used.

Unfortunately that option fails the tests.

> We are doing with our limited resources the best we can, and while ompio is 
> by no means perfect, we try to be responsive to issues reported by users and 
> value constructive feedback and disc

Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-15 Thread Dave Love via users
"Gabriel, Edgar via users"  writes:

> I will have a look at those tests. The recent fixes were not
> correctness, but performance fixes.
> Nevertheless, we used to pass the mpich tests, but I admit that it is
> not a testsuite that we run regularly, I will have a look at them. The
> atomicity tests are expected to fail, since this the one chapter of
> MPI I/O that is not implemented in ompio.

How should we know that's expected to fail?  It at least shouldn't fail
like that; set_atomicity doesn't return an error (which the test is
prepared for on a filesystem like pvfs2).  I assume doing nothing, but
appearing to, can lead to corrupt data, and I'm surprised that isn't
being seen already.

HDF5 requires atomicity -- at least to pass its tests -- so presumably
anyone like us who needs it should use something mpich-based with recent
or old romio, and that sounds like most general HPC systems.  Am I
missing something?

With the current romio everything I tried worked, but we don't get that
option with openmpi.


Re: [OMPI users] bad defaults with ucx

2021-01-14 Thread Dave Love via users
"Jeff Squyres (jsquyres)"  writes:

> Good question.  I've filed
> https://github.com/open-mpi/ompi/issues/8379 so that we can track
> this.

For the benefit of the list:  I mis-remembered that osc=ucx was general
advice.  The UCX docs just say you need to avoid the uct btl, which can
cause memory corruption, but OMPI 4.1 still builds and uses it by
default.  (The UCX doc also suggests other changes to parameters, but
for performance rather than correctness.)

Anyway, I can get at least IMB-RMA to run on this Summit-like hardware
just with --mca btl ^uct (though there are failures with other tests
which seem to be specific to UCX on ppc64le, and not to OMPI).



[OMPI users] bad defaults with ucx

2021-01-14 Thread Dave Love via users
Why does 4.1 still not use the right defaults with UCX?

Without specifying osc=ucx, IMB-RMA crashes like 4.0.5.  I haven't
checked what else it is UCX says you must set for openmpi to avoid
memory corruption, at least, but I guess that won't be right either.
Users surely shouldn't have to explore notes for a fundamental library
to be able to run even IMB.


[OMPI users] 4.1 mpi-io test failures on lustre

2021-01-14 Thread Dave Love via users
I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system
that I understand was used to fix ompio problems on lustre.  I'm puzzled
that I still see failures.

I don't know why there are disjoint sets in mpich's test/mpi/io and
src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io
defaults across two nodes.  In src/mpi/romio/test, atomicity failed
(ignoring error and syshints); in test/mpi/io, the failures were
setviewcur, tst_fileview, external32_derived_dtype, i_bigtype, and
i_setviewcur.  tst_fileview was probably killed by the 100s timeout.

It may be that some are only appropriate for romio, but no-one said so
before and they presumably shouldn't segv or report libc errors.

I built against ucx 1.9 with cuda support.  I realize that has problems
on ppc64le, with no action on the issue, but there's a limit to what I
can do.  cuda looks relevant since one test crashes while apparently
trying to register cuda memory; that's presumably not ompio's fault, but
we need cuda.


Re: [OMPI users] [EXTERNAL] RMA breakage

2020-12-11 Thread Dave Love via users
"Pritchard Jr., Howard"  writes:

> Hello Dave,
>
> There's an issue opened about this -
>
> https://github.com/open-mpi/ompi/issues/8252

Thanks.  I don't know why I didn't find that, unless I searched before
it appeared.  Obviously I was wrong to think it didn't look
system-specific without time to investigate.

Is anything being done to avoid having to set the MCA parameters that
are known to be needed with UCX?  It seems to be a continual source of
problems, but we need it for GPUs as far as I know.


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-07 Thread Dave Love via users
Ralph Castain via users  writes:

> Just a point to consider. OMPI does _not_ want to get in the mode of
> modifying imported software packages. That is a blackhole of effort we
> simply cannot afford.

It's already done that, even in flatten.c.  Otherwise updating to the
current version would be trivial.  I'll eventually make suggestions for
some changes in MPICH for standalone builds if I can verify that they
don't break things outside of OMPI.

Meanwhile we don't have a recent version that will even pass tests
recommended here, and we've long been asking about MPI-IO on lustre.  We
probably should move to some sort of MPICH for MPI-IO on probably the
most likely parallel filesystem as well as RMA on the most likely fabric.

> The correct thing to do would be to flag Rob Latham on that PR and ask
> that he upstream the fix into ROMIO so we can absorb it. We shouldn't
> be committing such things directly into OMPI itself.

It's already fixed differently in mpich, but the simple patch is useful
if there's nothing else broken.  I approve of sending fixes to MPICH,
but that will only do any good if OMPI's version gets updated from
there, which doesn't seem to happen.

> It's called "working with the community" as opposed to taking a
> point-solution approach :-)

The community has already done work to fix this properly.  It's a pity
that will be wasted.  This bit of the community is grateful for the
patch, which is reasonable to carry in packaging for now, unlike a whole
new romio.


[OMPI users] RMA breakage

2020-12-07 Thread Dave Love via users
After seeing several failures with RMA with the change needed to get
4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
3.4b1 tests -- or the ones that would build, and I haven't checked why
some fail -- and ran the rma set.

Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
production code, with a backtrace like below; others at least reported
an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
same result without those parameters).  I haven't tried to reproduce it
on x86_64, but it seems unlikely to be CPU-specific.

Is there anything we can do to run RMA without just moving to mpich?  Do
releases actually get tested on run-of-the-mill IB+Lustre systems?

+ mpirun -n 2 winname
[gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active message 
handler id 1: Invalid parameter
 backtrace (tid:  50906) 
 0 0x0005453c ucs_debug_print_backtrace()  .../src/ucs/debug/debug.c:656
 1 0x00028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
 2 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
 3 0x00029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
 4 0x00029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
 5 0x00029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
 6 0x000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
 7 0x00068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
 8 0x00057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
 9 0x00013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
10 0x000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
11 0x00057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
12 0x00049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
13 0x0002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
14 0x0002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
15 0x0002ca04 ucp_worker_progress()  .../src/ucp/core/ucp_worker.c:2040
16 0xc144 progress_callback()  osc_ucx_component.c:0
17 0x000374ac opal_progress()  ???:0
18 0x0006cc74 ompi_request_default_wait()  ???:0
19 0x000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
20 0x000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
21 0x6c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
22 0xdc20 component_select()  osc_ucx_component.c:0
23 0x00115b90 ompi_osc_base_select()  ???:0
24 0x00075264 ompi_win_create()  ???:0
25 0x000cb4e8 PMPI_Win_create()  ???:0
26 0x10006ecc MTestGetWin()  .../mpich-3.4b1/test/mpi/util/mtest.c:1173
27 0x10002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
28 0x00025200 generic_start_main.isra.0()  libc-start.c:0
29 0x000253f4 __libc_start_main()  ???:0

followed by the abort backtrace


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-02 Thread Dave Love via users
Mark Allen via users  writes:

> At least for the topic of why romio fails with HDF5, I believe this is the
> fix we need (has to do with how romio processes the MPI datatypes in its
> flatten routine).  I made a different fix a long time ago in SMPI for that,
> then somewhat more recently it was re-broke it and I had to re-fix it.  So
> the below takes a little more aggressive approach, not totally redesigning
> the flatten function, but taking over how the array size counter is handled.
> https://github.com/open-mpi/ompi/pull/3975
>
> Mark Allen

Thanks.  (As it happens, the system we're struggling on is an IBM one.)

In the meantime I've hacked in romio from mpich-4.3b1 without really
understanding what I'm doing; I think it needs some tidying up on both
the mpich and ompi sides.  That passed make check in testpar, assuming
the complaints from testpflush are the expected ones.  (I've not had
access to a filesystem with flock to run this previously.)

Perhaps it's time to update romio anyway.  It may only be relevant to
lustre, but I guess that's what most people have.


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-30 Thread Dave Love via users
As a check of mpiP, I ran HDF5 testpar/t_bigio under it.  This was on
one node with four ranks (interactively) on lustre with its default of
one 1MB stripe, ompi-4.0.5 + ucx-1.9, hdf5-1.10.7, MCA defaults.

I don't know how useful it is, but here's the summary:

romio:

  @--- Aggregate Time (top twenty, descending, milliseconds) 
  ---
  Call Site   TimeApp%MPI%  CountCOV
  File_write_at_all  26   2.58e+04   47.50   50.24 16   0.00
  File_read_at_all   14   2.42e+04   44.47   47.03 16   0.00
  File_set_view  295150.951.00 16   0.09
  File_set_view   33820.700.74 16   0.00

ompio:

  @--- Aggregate Time (top twenty, descending, milliseconds) 
  ---
  Call Site   TimeApp%MPI%  CountCOV
  File_read_at_all   14   3.32e+06   82.83   82.90 16   0.00
  File_write_at_all  26   6.72e+05   16.77   16.78 16   0.02
  File_set_view  11   1.14e+040.280.28 16   0.91
  File_set_view  293400.010.01 16   0.35

with call sites

   ID Lev File/AddressLine Parent_FunctMPI_Call
  
   11   0 H5FDmpio.c  1651 H5FD_mpio_write File_set_view
   14   0 H5FDmpio.c  1436 H5FD_mpio_read  
File_read_at_all
   26   0 H5FDmpio.c  1636 H5FD_mpio_write 
File_write_at_all

I also looked at the romio hang in testphdf5.  In the absence of a
parallel debugger, strace and kill show an endless loop of read(...,"",0)
under this:

  [login2:115045] [ 2] 
.../mca_io_romio321.so(ADIOI_LUSTRE_ReadContig+0xa8)[0x20003d1cab88]
  [login2:115045] [ 3] 
.../mca_io_romio321.so(ADIOI_GEN_ReadStrided+0x528)[0x20003d1e4f08]
  [login2:115045] [ 4] 
.../mca_io_romio321.so(ADIOI_GEN_ReadStridedColl+0x1084)[0x20003d1e4514]
  [login2:115045] [ 5] 
.../mca_io_romio321.so(MPIOI_File_read_all+0x124)[0x20003d1c37c4]
  [login2:115045] [ 6] 
.../mca_io_romio321.so(mca_io_romio_dist_MPI_File_read_at_all+0x34)[0x20003d1c41d4]
  [login2:115045] [ 7] 
.../mca_io_romio321.so(mca_io_romio321_file_read_at_all+0x3c)[0x20003d1bdabc]
  [login2:115045] [ 8] 
.../libmpi.so.40(PMPI_File_read_at_all+0x13c)[0x2078de4c]


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-27 Thread Dave Love via users
Mark Dixon via users  writes:

> But remember that IMB-IO doesn't cover everything.

I don't know what useful operations it omits, but it was the obvious
thing to run, that should show up pathology, with simple things first.
It does at least run, which was the first concern.

> For example, hdf5's
> t_bigio parallel test appears to be a pathological case and OMPIO is 2
> orders of magnitude slower on a Lustre filesystem:
>
> - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds
> - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds

It's less dramatic in the case I ran, but there's clearly something
badly wrong which needs profiling.  It's probably useful to know how
many ranks that's with, and whether it's the default striping.  (I
assume with default ompio fs parameters.)

> End users seem to have the choice of:
>
> - use openmpi 4.x and have some things broken (romio)
> - use openmpi 4.x and have some things slow (ompio)
> - use openmpi 3.x and everything works

I can have a look with the current or older romio, unless someone else
is going to; we should sort this.

> My concern is that openmpi 3.x is near, or at, end of life.

'Twas ever thus, but if it works?

[Posted in case it's useful, rather than discussing more locally.]


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-25 Thread Dave Love via users
I wrote: 

> The perf test says romio performs a bit better.  Also -- from overall
> time -- it's faster on IMB-IO (which I haven't looked at in detail, and
> ran with suboptimal striping).

I take that back.  I can't reproduce a significant difference for total
IMB-IO runtime, with both run in parallel on 16 ranks, using either the
system default of a single 1MB stripe or using eight stripes.  I haven't
teased out figures for different operations yet.  That must have been
done elsewhere, but I've never seen figures.


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-23 Thread Dave Love via users
Mark Dixon via users  writes:

> Surely I cannot be the only one who cares about using a recent openmpi
> with hdf5 on lustre?

I generally have similar concerns.  I dug out the romio tests, assuming
something more basic is useful.  I ran them with ompi 4.0.5+ucx on
Mark's lustre system (similar to a few nodes of Summit, apart from the
filesystem, but with quad-rail IB which doesn't give the bandwidth I
expected).

The perf test says romio performs a bit better.  Also -- from overall
time -- it's faster on IMB-IO (which I haven't looked at in detail, and
ran with suboptimal striping).

  Test: perf
  romio321
  Access size per process = 4194304 bytes, ntimes = 5
  Write bandwidth without file sync = 19317.372354 Mbytes/sec
  Read bandwidth without prior file sync = 35033.325451 Mbytes/sec
  Write bandwidth including file sync = 1081.096713 Mbytes/sec
  Read bandwidth after file sync = 47135.349155 Mbytes/sec
  ompio
  Access size per process = 4194304 bytes, ntimes = 5
  Write bandwidth without file sync = 18442.698536 Mbytes/sec
  Read bandwidth without prior file sync = 31958.198676 Mbytes/sec
  Write bandwidth including file sync = 1081.058583 Mbytes/sec
  Read bandwidth after file sync = 31506.854710 Mbytes/sec

However, romio coll_perf fails as follows, and ompio runs.  Isn't there
mpi-io regression testing?

  [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not 
mapped to object at address 0x1fffbc10)
   backtrace (tid:  89063) 
   0 0x0005453c ucs_debug_print_backtrace()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656
   1 0x00041b04 ucp_rndv_pack_data()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335
   2 0x0001c814 uct_self_ep_am_bcopy()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278
   3 0x0003f7ac uct_ep_am_bcopy()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561
   4 0x0003f7ac ucp_do_am_bcopy_multi()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79
   5 0x0003f7ac ucp_rndv_progress_am_bcopy()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352
   6 0x00041cb8 ucp_request_try_send()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
   7 0x00041cb8 ucp_request_send()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
   8 0x00041cb8 ucp_rndv_rtr_handler()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754
   9 0x0001c984 uct_iface_invoke_am()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635
  10 0x0001c984 uct_self_iface_sendrecv_am()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149
  11 0x0001c984 uct_self_ep_am_short()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262
  12 0x0002ee30 uct_ep_am_short()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549
  13 0x0002ee30 ucp_do_am_single()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68
  14 0x00042908 ucp_proto_progress_rndv_rtr()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172
  15 0x0003f4c4 ucp_request_try_send()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
  16 0x0003f4c4 ucp_request_send()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
  17 0x0003f4c4 ucp_rndv_req_send_rtr()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423
  18 0x00045214 ucp_rndv_matched()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262
  19 0x00046158 ucp_rndv_process_rts()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1280
  20 0x00046268 ucp_rndv_rts_handler()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1304
  21 0x0001c984 uct_iface_invoke_am()  
/tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcr

[OMPI users] experience on POWER?

2020-10-24 Thread Dave Love via users
Can anyone report experience with recent OMPI on POWER (ppc64le)
hardware, e.g. Summit?  When I tried on similar nodes to Summit's (but
fewer!), the IMB-RMA benchmark SEGVs early on.  Before I try to debug
it, I'd be interested to know if anyone else has investigated that or
had better luck and, if so, how.
See 


Re: [OMPI users] relocating an installation

2019-04-10 Thread Dave Love
In fact, setting OPAL_PREFIX doesn't work for a relocated tree (with
OMPI 1.10 or 3.0).  You also need $OPAL_PREFIX/lib and
$OPAL_PREFIX/lib/openmpi on LD_LIBRARY_PATH (assuming $MPI_LIB=$MPI_HOME/lib):

  $ OPAL_PREFIX=$(pwd)/usr/lib64/openmpi3 ./usr/lib64/openmpi3/bin/mpirun 
mpirun true
  ./usr/lib64/openmpi3/bin/mpirun: error while loading shared libraries: 
libopen-rte.so.40: cannot open shared object file: No such file or directory
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] relocating an installation

2019-04-10 Thread Dave Love
"Jeff Squyres (jsquyres) via users"  writes:

> Reuti's right.
>
> Sorry about the potentially misleading use of "--prefix" -- we
> basically inherited that CLI option from a different MPI
> implementation (i.e., people asked for it).  So we were locked into
> that meaning for the "--prefix" CLI options.

Reading more closely, I see it's only for remote hosts anyhow, but I'd
be surprised if only binaries and libraries were relocated from one
tree, and only hosts that somehow were distinguished from the one
invoking mpirun (et al).

Regardless, is there some problem just documenting the relevant
variables in the obvious places?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] relocating an installation

2019-04-10 Thread Dave Love
Reuti  writes:

>> It should be documented.
>
> There is this FAQ entry:
>
> https://www.open-mpi.org/faq/?category=building#installdirs

For what it's worth, I looked under "running" in the FAQ, as I was after
a runtime switch.  I expect FAQs to point to the actual documentation,
though, and an environment variables section of the man pages seems the
right place.  [I know you're only providing info, thanks.]
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] relocating an installation

2019-04-09 Thread Dave Love
Reuti  writes:

> export OPAL_PREFIX=
>
> to point it to the new location of installation before you start `mpiexec`.

Thanks; that's now familiar, and I don't know how I missed it with
strings.

It should be documented.  I'd have expected --prefix to have the same
effect, and for there to be an MCA variable.  Would there be some
problem with either of those?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] relocating an installation

2019-04-09 Thread Dave Love
Is it possible to use the environment or mpirun flags to run an OMPI
that's been relocated from where it was configured/installed?  (Say
you've unpacked a system package that expects to be under /usr and want
to run it from home without containers etc.)  I thought that was
possible, but I haven't found a way that works.  Using --prefix doesn't
find help files, at least.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] filesystem-dependent failure building Fortran interfaces

2018-12-11 Thread Dave Love
Jeff Hammond  writes:

> Preprocessor is fine in Fortran compilers. We’ve used in NWChem for many
> years, and NWChem supports “all the compilers”.
>
> Caveats:
> - Cray dislikes recursive preprocessing logic that other compilers handle.
> You won’t use this so please ignore.
> - IBM XLF requires -WF,-DFOO=BAR instead of -DFOO=BAR but this is strictly
> a build system issue so you can ignore.
> - GCC Fortran supports classic/legacy version of CPP so you can’t use
> certain things that work in C/C++ but again, this should affect you.

Off-topic, but my experience of cpp/Fortran gotchas is wider, partly as
a GNU Fortran maintainer, though doubtless it's less of an issue in a
GNU/Linux monoculture.

-- 
The ultimate computing slogan is "Your Mileage May Vary".  — Richard O'Keefe
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] filesystem-dependent failure building Fortran interfaces

2018-12-05 Thread Dave Love
"Jeff Squyres (jsquyres) via users"  writes:

> Hi Dave; thanks for reporting.
>
> Yes, we've fixed this -- it should be included in 4.0.1.
>
> https://github.com/open-mpi/ompi/pull/6121

Good, but I'm confused; I checked the repo before reporting it.
[I wince at processing Fortran with cpp, though I don't know how how
robust it is these days in GNU Fortran.]
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] filesystem-dependent failure building Fortran interfaces

2018-12-04 Thread Dave Love
If you try to build somewhere out of tree, not in a subdir of the
source, the Fortran build is likely to fail because mpi-ext-module.F90
does

   include 
'/openmpi-4.0.0/ompi/mpiext/pcollreq/mpif-h/mpiext_pcollreq_mpifh.h'

and can exceed the fixed line length.  It either needs to add (the
compiler's equivalent of gfortran's) -ffixed-line-length-none to FFLAGS
or, I guess, set the include path; the latter may be more robust.

[The situation arises, for instance, if the source location is
read-only.  I haven't checked, but I think this was OK in v3.]
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-16 Thread Dave Love
"Gabriel, Edgar"  writes:

> a) if we detect a Lustre file system without flock support, we can
> printout an error message. Completely disabling MPI I/O is on the
> ompio architecture not possible at the moment, since the Lustre
> component can disqualify itself, but the generic Unix FS component
> would kick in in that case, and still continue execution. To be more
> precise, the query function of the Lustre component has no way to
> return anything than "I am interested to run" or "I am not interested
> to run"
>
> b) I can add an MCA parameter that would allow the Lustre component to
> abort execution of the job entirely. While this parameter would
> probably be by default set to 'false', a system administrator could
> configure it to be set to 'true' an particular platform.

Assuming the operations which didn't fail for me are actually OK with
noflock (and maybe they're not in other circumstances), can't you just
do the same as ROMIO and fail with an explanation on just the ones that
will fail without flock?  That seems the best from a user's point of
view if there's an advantage to using OMPIO rather than ROMIO.

I guess it might be clear which operations are problematic if I
understood what in fs/lustre requires flock mounts and what the full
semantics of the option are, which seem to be more than documented.

Thanks for looking into it, anyhow.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-16 Thread Dave Love
"Latham, Robert J."  writes:

> it's hard to implement fcntl-lock-free versions of Atomic mode and
> Shared file pointer so file systems like PVFS don't support those modes
> (and return an error indicating such at open time).

Ah.  For some reason I thought PVFS had the support to pass the tests
somehow, but it's been quite a while since I used it.

> You can run lock-free for noncontiguous writes, though at a significant
> performance cost.  In ROMIO we can disable data sieving write by
> setting the hint "romio_ds_write" to "disable", which will fall back to
> piece-wise operations.  Could be OK if you know your noncontiguous
> accesses are only a little bit noncontiguous.

Does that mean it could actually support more operations (without
failing due to missing flock)?

Of course, I realize one should just use flock mounts with Lustre, as I
used to.  I don't remember this stuff being written down explicitly
anywhere, though -- is it somewhere?

Thanks for the info.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Dave Love
For what it's worth, I found the following from running ROMIO's tests
with OMPIO on Lustre mounted without flock (or localflock).  I used 48
processes on two nodes with Lustre for tests which don't require a
specific number.

OMPIO fails tests atomicity, misc, and error on ext4; it additionally
fails noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.

On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
shared_fp, ordered_fp, and error.

Please can OMPIO be changed to fail in the same way as ROMIO (with a
clear message) for the operations it can't support without flock.
Otherwise it looks as if you can potentially get invalid data, or at
least waste time debugging other errors.

I'd debug the common failure on the "error" test, but ptrace is disabled
on the system.

In case anyone else is in the same boat and can't get mounts changed, I
suggested staging data to and from a PVFS2^WOrangeFS ephemeral
filesystem on jobs' TMPDIR local mounts if they will fit.  Of course
other libraries will potentially corrupt data on nolock mounts.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] no openmpi over IB on new CentOS 7 system

2018-10-10 Thread Dave Love
RDMA was just broken in the last-but-one(?) RHEL7 kernel release, in
case that's the problem.  (Fixed in 3.10.0-862.14.4.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-10 Thread Dave Love
"Gabriel, Edgar"  writes:

> Ok, thanks. I usually run these test with 4 or 8, but the major item
> is that atomicity is one of the areas that are not well supported in
> ompio (along with data representations), so a failure in those tests
> is not entirely surprising . 

If it's not expected to work, could it be made to return a helpful
error, rather than just not working properly?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-09 Thread Dave Love
"Gabriel, Edgar"  writes:

> Hm, thanks for the report, I will look into this. I did not run the
> romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
> should not have any problems on a regular unix fs. How many processes
> did you use, and which tests did you run specifically? The main tests
> that I execute from their parallel testsuite are testphdf5 and
> t_shapesame.

Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core
SMP job (so 24 processes), where $TMPDIR is on ext4:

  export HDF5_PARAPREFIX=$TMPDIR
  make check RUNPARALLEL='mpirun'

It stopped after testphdf5 spewed "Atomicity Test Failed" errors.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-08 Thread Dave Love
I said I'd report back about trying ompio on lustre mounted without flock.

I couldn't immediately figure out how to run MTT.  I tried the parallel
hdf5 tests from the hdf5 1.10.3, but I got errors with that even with
the relevant environment variable to put the files on (local) /tmp.
Then it occurred to me rather late that romio would have tests.  Using
the "runtests" script modified to use "--mca io ompio" in the romio/test
directory from ompi 3.1.2 on no-flock-mounted Lustre, after building the
tests with an installed ompi-3.1.2, it did this and apparently hung at
the end:

   Testing simple.c 
   No Errors
   Testing async.c 
   No Errors
   Testing async-multiple.c 
   No Errors
   Testing atomicity.c 
  Process 3: readbuf[118] is 0, should be 10
  Process 2: readbuf[65] is 0, should be 10
  --
  MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
  with errorcode 1.
  
  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --
  Process 1: readbuf[145] is 0, should be 10
   Testing coll_test.c 
   No Errors
   Testing excl.c 
  error opening file test
  error opening file test
  error opening file test

Then I ran on local /tmp as a sanity check and still got errors:

   Testing I/O functions 
   Testing simple.c 
   No Errors
   Testing async.c 
   No Errors
   Testing async-multiple.c 
   No Errors
   Testing atomicity.c 
  Process 2: readbuf[155] is 0, should be 10
  Process 1: readbuf[128] is 0, should be 10
  Process 3: readbuf[128] is 0, should be 10
  --
  MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
  with errorcode 1.
  
  NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  You may or may not see output from other processes, depending on
  exactly when Open MPI kills them.
  --
   Testing coll_test.c 
   No Errors
   Testing excl.c 
   No Errors
   Testing file_info.c 
   No Errors
   Testing i_noncontig.c 
   No Errors
   Testing noncontig.c 
   No Errors
   Testing noncontig_coll.c 
   No Errors
   Testing noncontig_coll2.c 
   No Errors
   Testing aggregation1 
   No Errors
   Testing aggregation2 
   No Errors
   Testing hindexed 
   No Errors
   Testing misc.c 
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn = 265, should be 10
  
  byte offset = 3020, should be 1080
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn in bytes = 3280, should be 1000
  
  file pointer posn in bytes = 3280, should be 1000
  
  Found 12 errors
   Testing shared_fp.c 
   No Errors
   Testing ordered_fp.c 
   No Errors
   Testing split_coll.c 
   No Errors
   Testing psimple.c 
   No Errors
   Testing error.c 
  File set view did not return an error
   Found 1 errors
   Testing status.c 
   No Errors
   Testing types_with_zeros 
   No Errors
   Testing darray_read 
   No Errors

I even got an error with romio on /tmp (modifying the script to use
mpirun --mca io romi314):

   Testing error.c 
  Unexpected error message MPI_ERR_ARG: invalid argument of some other kind
   Found 1 errors
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-05 Thread Dave Love
"Gabriel, Edgar"  writes:

> It was originally for performance reasons, but this should be fixed at
> this point. I am not aware of correctness problems.
>
> However, let me try to clarify your question about: What do you
> precisely mean by "MPI I/O on Lustre mounts without flock"? Was the
> Lustre filesystem mounted without flock?

No, it wasn't (and romio complains).

> If yes, that could lead to
> some problems, we had that on our Lustre installation for a while, but
> problems were even occurring without MPI I/O in that case (although I
> do not recall all details, just that we had to change the mount
> options).

Yes, without at least localflock you might expect problems with things
like bdb and sqlite, but I couldn't see any file locking calls in the
Lustre component.  If it is a problem, shouldn't the component fail like
without it like romio does?

I have suggested ephemeral PVFS^WOrangeFS but I doubt that will be
thought useful.

> Maybe just take a testsuite (either ours or HDF5), make sure
> to run it in a multi-node configuration and see whether it works
> correctly.

For some reason I didn't think MTT, if that's what you mean, was
available, but I see it is; I'll see if I can drive it when I have a
chance.  Tests from HDF5 might be easiest, thanks for the suggestion.
I'd tried with ANL's "testmpio", which was the only thing I found
immediately, but it threw up errors even on a local filesystem, at which
stage I thought it was best to ask...  I'll report back if I get useful
results.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] ompio on Lustre

2018-10-05 Thread Dave Love
Is romio preferred over ompio on Lustre for performance or correctness?
If it's relevant, the context is MPI-IO on Lustre mounts without flock,
which ompio doesn't seem to require.
Thanks.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Old version openmpi 1.2 support infiniband?

2018-03-29 Thread Dave Love
Kaiming Ouyang  writes:

> Hi Jeff,
> Thank you for your advice. I will contact the author for some suggestions.
> I also notice I may port this old version library to new openmpi 3.0. I
> will work on this soon. Thank you.

I haven't used them, but at least the profiling part, and possibly
control, should be covered by plugins at .
(Score-P is the replacement for the vampirtrace instrumentation included
with openmpi until recently; I think the vampirtrace plugin interface is
compatible with score-p's.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology [#20160912-1315]

2017-10-20 Thread Dave Love
Paul Kapinos  writes:

> Hi all,
> sorry for the long long latency - this message was buried in my mailbox for
> months
>
>
>
> On 03/16/2017 10:35 AM, Alfio Lazzaro wrote:
>> Hello Dave and others,
>> we jump in the discussion as CP2K developers.
>> We would like to ask you which version of CP2K you are using in your tests 
> version 4.1 (release)
>
>> and
>> if you can share with us your input file and output log.
>
> The input file is property of Mathias Schumacher (CC:) and we need a 
> permission
> of him to provide it.

I lost track of this, but the problem went away using libfabric instead
of openib, so I left it at that, though libfabric hurt IMB pingpong
latency compared with openib.

I seem to remember there's a workaround in the cp2k development source,
but that obviously doesn't solve the general problem.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-20 Thread Dave Love
Jeff Hammond  writes:

> Intel compilers support GOMP runtime interoperability, although I don't
> believe it is the default. You can use the Intel/LLVM OpenMP runtime with
> GCC such that all three OpenMP compilers work together.

For what it's worth, it's trivial to make a shim with a compatible
soname (packaged for Fedora but hasn't got past review).

> Fortran is a legit problem, although if somebody builds a standalone
> Fortran 2015 implementation of the MPI interface, it would be decoupled
> from the MPI library compilation.

Unfortunately gfortran isn't even compatible with itself, as Jeff
doubtless knows -- the module file format is tied to the compiler major
version and the libgfortran soname has been changed recently (rather
than versioning symbols, similarly to openmpi).
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Question concerning compatibility of languages used with building OpenMPI and languages OpenMPI uses to build MPI binaries.

2017-09-20 Thread Dave Love
Jeff Hammond  writes:

> Please separate C and C++ here. C has a standard ABI.  C++ doesn't.
>
> Jeff

[For some value of "standard".]  I've said the same about C++, but the
current GCC manual says its C++ ABI is "industry standard", and at least
Intel document compatibility with recent GCC on GNU/Linux.  It's
standard enough to have changed for C++11 (?), with resulting grief in
package repos, for instance.

-- 
Fifty years of programming language research, and we end up with C++ ???
  — Richard O'Keefe
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
> the builtin memchecker can detect MPI usage errors such as modifying
> the buffer passed to MPI_Isend() before the request completes

OK, thanks.  The implementation looks rather different, and it's not
clear without checking the code in detail how it differs from the
preload library (which does claim to check at least some correctness) or
why that that sort of check has to be built in.

> all the extra work is protected
> if ( running_under_valgrind() ) {
>extra_checks();
> }
>
> so if you are not running under valgrind, the overhead should be unnoticeable

Thanks.  Is there a good reason not to enable it by default, then?
(Apologies that I've just found and checked the FAQ entry, and it does
actually say that, in contradiction to the paper it references.  I
assume the implementation has changed since then.)

A deficiency of the preload library I just realized is that it says it's
only MPI-2.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Christoph Niethammer  writes:

> Hi Dave,
>
> The memchecker interface is an addition which allows other tools to be
> used as well.

Do you mean it allows other things to be hooked in other than through
PMPI?

> A more recent one is memPin [1].

Thanks, but Pin is proprietary, so it's no use as an alternative in this
case.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] built-in memchecker support

2017-08-24 Thread Dave Love
Apropos configuration parameters for packaging:

Is there a significant benefit to configuring built-in memchecker
support, rather than using the valgrind preload library?  I doubt being
able to use another PMPI tool directly at the same time counts.

Also, are there measurements of the performance impact of configuring,
but not using, it with recent hardware and software?  I don't know how
relevant the results in https://www.open-mpi.org/papers/parco-2007/
would be now, especially on a low-latency network.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Questions about integration with resource distribution systems

2017-08-02 Thread Dave Love
Reuti  writes:

>> I should qualify that by noting that ENABLE_ADDGRP_KILL has apparently
>> never propagated through remote startup,
>
> Isn't it a setting inside SGE which the sge_execd is aware of? I never
> exported any environment variable for this purpose.

Yes, but this is surely off-topic, even though
 mentions openmpi.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] --enable-builtin-atomics

2017-08-02 Thread Dave Love
"Barrett, Brian via users"  writes:

> Well, if you’re trying to get Open MPI running on a platform for which
> we don’t have atomics support, built-in atomics solves a problem for
> you…

That's not an issue in this case, I think.  (I'd expect it to default to
intrinsic if extrinsic support is missing.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] --enable-builtin-atomics

2017-08-02 Thread Dave Love
Nathan Hjelm  writes:

> So far only cons. The gcc and sync builtin atomic provide slower
> performance on x86-64 (and possible other platforms). I plan to
> investigate this as part of the investigation into requiring C11
> atomics from the C compiler.

Thanks.  Is that a gcc deficiency, or do the intrinsics just do
something different (more extensive)?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Questions about integration with resource distribution systems

2017-08-01 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
>
> unless you are doing direct launch (for example, use 'srun' instead of
> 'mpirun' under SLURM),
>
> this is the way Open MPI is working : mpirun will use whatever the
> resource manager provides
>
> in order to spawn the remote orted (tm with PBS, qrsh with SGE, srun
> with SLURM, ...).
>
>
> then mpirun/orted will fork&exec the MPI tasks.

I know quite well how SGE works with openmpi, which isn't special --
I've done enough work on it.  SGE tracks the process tree under orted
just like under bash, even if things daemonize.  The OP was correct.

I should qualify that by noting that ENABLE_ADDGRP_KILL has apparently
never propagated through remote startup, so killing those orphans after
VASP crashes may fail, though resource reporting works.  (I never
installed a fix for want of a test system, but it's not needed with
Linux cpusets.)
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] --enable-builtin-atomics

2017-08-01 Thread Dave Love
What are the pros and cons of configuring with --enable-builtin-atomics?
I haven't spotted any discussion of the option.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] absolute paths printed by info programs

2017-08-01 Thread Dave Love
ompi_info et al print absolute compiler paths for some reason.  What
would they ever be used for, and are they intended to refer to the OMPI
build or application building?  They're an issue for packaging in Guix,
at least.  Similarly, what's io_romio_complete_configure_params intended
to be used for?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] NUMA interaction with Open MPI

2017-07-27 Thread Dave Love
Gilles Gouaillardet  writes:

> Adam,
>
> keep in mind that by default, recent Open MPI bind MPI tasks
> - to cores if -np 2
> - to NUMA domain otherwise

Not according to ompi_info from the latest release; it says socket.

> (which is a socket in most cases, unless
> you are running on a Xeon Phi)

[There have been multiple nodes/socket on x86 since Magny Cours, and
it's also relevant for POWER.  That's a reason things had to switch to
hwloc from whatever the predecessor was called.]

> so unless you specifically asked mpirun to do a binding consistent
> with your needs, you might simply try to ask no binding at all
> mpirun --bind-to none ...

Why would you want to turn off core binding?  The resource manager is
likely to supply a binding anyhow if incomplete nodes are allocated.

> i am not sure whether you can direclty ask Open MPI to do the memory
> binding you expect from the command line.

You can't control memory binding as far as I can tell.  That's
specifically important on KNL, which was brought up here some time ago.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Questions about integration with resource distribution systems

2017-07-27 Thread Dave Love
"r...@open-mpi.org"  writes:

> Oh no, that's not right. Mpirun launches daemons using qrsh and those
> daemons spawn the app's procs. SGE has no visibility of the app at all

Oh no, that's not right.

The whole point of tight integration with remote startup using qrsh is
to report resource usage and provide control over the job.  I'm somewhat
familiar with this.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-21 Thread Dave Love
I wrote: 

> But it works OK with libfabric (ofi mtl).  Is there a problem with
> libfabric?

Apparently there is, or at least with ompi 1.10.  I've now realized IMB
pingpong latency on a QDR IB system with ompi 1.10.6+libfabric is
~2.5μs, which it isn't with ompi 1.6 openib.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-15 Thread Dave Love
Paul Kapinos  writes:

> Nathan,
> unfortunately '--mca memory_linux_disable 1' does not help on this
> issue - it does not change the behaviour at all.
>  Note that the pathological behaviour is present in Open MPI 2.0.2 as
> well as in /1.10.x, and Intel OmniPath (OPA) network-capable nodes are
> affected only.

[I guess that should have been "too" rather than "only".  It's loading
the openib btl that is the problem.]

> The known workaround is to disable InfiniBand failback by '--mca btl
> ^tcp,openib' on nodes with OPA network. (On IB nodes, the same tweak
> lead to 5% performance improvement on single-node jobs;

It was a lot more than that in my cp2k test.

> but obviously
> disabling IB on nodes connected via IB is not a solution for
> multi-node jobs, huh).

But it works OK with libfabric (ofi mtl).  Is there a problem with
libfabric?

Has anyone reported this issue to the cp2k people?  I know it's not
their problem, but I assume they'd like to know for users' sake,
particularly if it's not going to be addressed.  I wonder what else
might be affected.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-09 Thread Dave Love
Nathan Hjelm  writes:

> If this is with 1.10.x or older run with --mca memory_linux_disable
> 1. There is a bad interaction between ptmalloc2 and psm2 support. This
> problem is not present in v2.0.x and newer.

Is that applicable to openib too?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openib/mpi_alloc_mem pathology

2017-03-09 Thread Dave Love
Paul Kapinos  writes:

> Hi Dave,
>
>
> On 03/06/17 18:09, Dave Love wrote:
>> I've been looking at a new version of an application (cp2k, for for what
>> it's worth) which is calling mpi_alloc_mem/mpi_free_mem, and I don't
>
> Welcome to the club! :o)
> In our measures we see some 70% of time in 'mpi_free_mem'... and 15x
> performance loss if using Open MPI vs. Intel MPI. So it goes.
>
> https://www.mail-archive.com/users@lists.open-mpi.org//msg30593.html

Ah, that didn't match my search terms.

Did cp2k's own profile not show the site of the slowdown (MP_Mem, if I
recall correctly)?  Maybe it's a different issue, especially if IMPI
surprisingly wins so much over IB -- even if it isn't subject to the
same pathology and is using a better collective algorithms.  For a
previous version of cp2k, my all-free software build was reported faster
than an all-Intel build on a similar system with faster processors.

OPA performance would be interesting if you could report it, say, for a
reasonably large cp2k quickstep run, especially if IB+libfabric results
were available on the same system.  (The two people I know who were
measuring OPA were NDA'd when I last knew.)

>> think it did so the previous version I looked at.  I found on an
>> IB-based system it's spending about half its time in those allocation
>> routines (according to its own profiling) -- a tad surprising.
>>
>> It turns out that's due to some pathological interaction with openib,
>> and just having openib loaded.  It shows up on a single-node run iff I
>> don't suppress the openib btl, and doesn't for multi-node PSM runs iff I
>> suppress openib (on a mixed Mellanox/Infinipath system).
>
> we're lucky - our issue is on Intel OmniPath (OPA) network (and we
> will junk IB hardware in near future, I think) - so we disabled the IB
> transport failback,
> --mca btl ^tcp,openib

That's what I did, but could still run with IB under OMPI 1.10 using the
ofi mtl.

> For single-node jobs this will also help on plain IB nodes,
> likely. (you can disable IB if you do not use it)

Yes, I guess I wasn't clear.

I'd still like to know the basic reason for this, and whether it's
OMPI-specific, if someone can say.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] openib/mpi_alloc_mem pathology

2017-03-06 Thread Dave Love
I've been looking at a new version of an application (cp2k, for for what
it's worth) which is calling mpi_alloc_mem/mpi_free_mem, and I don't
think it did so the previous version I looked at.  I found on an
IB-based system it's spending about half its time in those allocation
routines (according to its own profiling) -- a tad surprising.

It turns out that's due to some pathological interaction with openib,
and just having openib loaded.  It shows up on a single-node run iff I
don't suppress the openib btl, and doesn't for multi-node PSM runs iff I
suppress openib (on a mixed Mellanox/Infinipath system).

Can anyone say why, and whether there's a workaround?  (I can't easily
diagnose what it's up to as ptrace is turned off on the system
concerned, and I can't find anything relevant in archives.)

I had the idea to try libfabric instead for multi-node jobs, and that
doesn't show the pathological behaviour iff openib is suppressed.
However, it requires ompi 1.10, not 1.8, which I was trying to use.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-12 Thread Dave Love
Andreas Schäfer  writes:

>> Yes, as root, and there are N different systems to at least provide
>> unprivileged read access on HPC systems, but that's a bit different, I
>> think.
>
> LIKWID[1] uses a daemon to provide limited RW access to MSRs for
> applications. I wouldn't wonder if support for this was added to
> LIKWID by RRZE.

Yes, that's one of the N I had in mind; others provide Linux modules.

>From a system manager's point of view it's not clear what are the
implications of the unprivileged access, or even how much it really
helps.  I've seen enough setups suggested for HPC systems in areas I
understand (and used by vendors) which allow privilege escalation more
or less trivially, maybe without any real operational advantage.  If
it's clearly safe and helpful then great, but I couldn't assess that.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] MPI+OpenMP core binding redux

2016-12-08 Thread Dave Love
I think there was a suggestion that the SC16 material would explain how
to get appropriate core binding for MPI+OpenMP (i.e. OMP_NUM_THREADS
cores/process), but it doesn't as far as I can see.

Could someone please say how you're supposed to do that in recent
versions (without relying on bound DRM slots), and provide a working
example in the documentation?  It seems a fairly important case that
should be clear.  Thanks.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-12-08 Thread Dave Love
Jeff Hammond  writes:

>>
>>
>> > Note that MPI implementations may be interested in taking advantage of
>> > https://software.intel.com/en-us/blogs/2016/10/06/intel-
>> xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.
>>
>> Is that really useful if it's KNL-specific and MSR-based, with a setup
>> that implementations couldn't assume?
>>
>>
> Why wouldn't it be useful in the context of a parallel runtime system like
> MPI?  MPI implementations take advantage of all sorts of stuff that needs
> to be queried with configuration, during compilation or at runtime.

I probably should have said "useful in practice".  The difference from
other things I can think of is that access to MSRs is privileged, and
it's not clear to me what the implications are of changing it or to what
extent you can assume people will.

> TSX requires that one check the CPUID bits for it, and plenty of folks are
> happily using MSRs (e.g.
> http://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html).

Yes, as root, and there are N different systems to at least provide
unprivileged read access on HPC systems, but that's a bit different, I
think.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] An old code compatibility

2016-11-15 Thread Dave Love
Mahmood Naderan  writes:

> Hi,
> The following mpifort command fails with a syntax error. It seems that the
> code is compatible with old gfortran, but I am not aware of that. Any idea
> about that?
>
> mpifort -ffree-form -ffree-line-length-0 -ff2c -fno-second-underscore
> -I/opt/fftw-3.3.5/include  -O3  -c xml.f90
> xml.F:641.46:
>
>CALL XML_TAG("set", comment="spin "
>   1
> Error: Syntax error in argument list at (1)
>
>
>
>
> In the source code, that line is
>
> CALL XML_TAG("set", comment="spin "//TRIM(ADJUSTL(strcounter)))

Apparently that mpifort is running cpp on .f90 files, and without
--traditional.  I've no idea how it could be set up to do that; gfortran
itself won't do it.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-11-09 Thread Dave Love
Jeff Hammond  writes:

>> I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
>> RHEL 6 or 7, without specific tuning, on recent Intel).  It doesn't look
>> like something you want in paths that should be low latency, but maybe
>> there's something you can do to improve that?  (sched_yield takes <1μs.)
>
> I demonstrated a bunch of different implementations with the instruction to
> "pick one of these...", where establishing the relationship between
> implementation and performance was left as an exercise for the reader :-)

The point was that only the one seemed available on RHEL6 to this
exercised reader.  No complaints about the useful list of possibilities.

> Note that MPI implementations may be interested in taking advantage of
> https://software.intel.com/en-us/blogs/2016/10/06/intel-xeon-phi-product-family-x200-knl-user-mode-ring-3-monitor-and-mwait.

Is that really useful if it's KNL-specific and MSR-based, with a setup
that implementations couldn't assume?

>> Is cpu_relax available to userland?  (GCC has an x86-specific intrinsic
>> __builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
>> gcc-4.4.)
>
> The pause instruction is available in ring3.  Just use that if cpu_relax
> wrapper is not implemented.

[OK; I meant in a userland library.]

Are there published measurements of the typical effects of spinning and
ameliorations on some sort of "representative" system?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] mpi4py+OpenMPI: Qs about submitting bugs and examples

2016-11-07 Thread Dave Love
"r...@open-mpi.org"  writes:

>> Is this mailing list a good spot to submit bugs for OpenMPI? Or do I
>> use github?
>
> You can use either - I would encourage the use of github “issues” when
> you have a specific bug, and the mailing list for general questions

I was told not to do that, and to send here instead; README was even
changed to say so.  It doesn't seem a good way of getting issues
addressed.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Redusing libmpi.so size....

2016-11-07 Thread Dave Love
Mahesh Nanavalla  writes:

> Hi all,
>
> I am using openmpi-1.10.3.
>
> openmpi-1.10.3 compiled for  arm(cross compiled on X86_64 for openWRT
> linux)  libmpi.so.12.0.3 size is 2.4MB,but if i compiled on X86_64 (linux)
> libmpi.so.12.0.3 size is 990.2KB.
>
> can anyone tell how to reduce the size of libmpi.so.12.0.3 compiled for
>  arm.

Do what Debian does for armel?

  du -h lib/openmpi/lib/libmpi.so.20.0.1
  804K  lib/openmpi/lib/libmpi.so.20.0.1

[What's ompi useful for on an openWRT system?]
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-11-07 Thread Dave Love
"r...@open-mpi.org"  writes:

> Yes, I’ve been hearing a growing number of complaints about cgroups for that 
> reason. Our mapping/ranking/binding options will work with the cgroup 
> envelope, but it generally winds up with a result that isn’t what the user 
> wanted or expected.

How?  I don't understand as an implementor why there's a difference from
just resource manager core binding, assuming the programs don't try to
escape the binding.  (I'm not saying there's nothing wrong with cgroups
in general...)

> We always post the OMPI BoF slides on our web site, and we’ll do the same 
> this year. I may try to record webcast on it and post that as well since I 
> know it can be confusing given all the flexibility we expose.
>
> In case you haven’t read it yet, here is the relevant section from “man 
> mpirun”:

I'm afraid I read that, and various versions of the code at different
times, and I've worked on resource manager core binding.  I still had to
experiment to find a way to run mpi+openmp jobs correctly, in multiple
ompi versions.  NEWS usually doesn't help, nor conference talks for
people who aren't there and don't know they should search beyond the
documentation.  We don't even seem to be able to make reliable bug
reports as they may or may not get picked up here.

Regardless, I can't see how binding to socket can be a good default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to yield CPU more when not computing (was curious behavior during wait for broadcast: 100% cpu)

2016-11-07 Thread Dave Love
[Some time ago]
Jeff Hammond  writes:

> If you want to keep long-waiting MPI processes from clogging your CPU
> pipeline and heating up your machines, you can turn blocking MPI
> collectives into nicer ones by implementing them in terms of MPI-3
> nonblocking collectives using something like the following.

I see sleeping for ‘0s’ typically taking ≳50μs on Linux (measured on
RHEL 6 or 7, without specific tuning, on recent Intel).  It doesn't look
like something you want in paths that should be low latency, but maybe
there's something you can do to improve that?  (sched_yield takes <1μs.)

> I typed this code straight into this email, so you should validate it
> carefully.

...

> #elif USE_CPU_RELAX
> cpu_relax(); /*
> http://linux-kernel.2935.n7.nabble.com/x86-cpu-relax-why-nop-vs-pause-td398656.html
> */

Is cpu_relax available to userland?  (GCC has an x86-specific intrinsic
__builtin_ia32_pause in fairly recent versions, but it's not in RHEL6's
gcc-4.4.)
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using Open MPI with multiple versions of GCC and G++

2016-10-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> Especially with C++, the Open MPI team strongly recommends you
> building Open MPI with the target versions of the compilers that you
> want to use.  Unexpected things can happen when you start mixing
> versions of compilers (particularly across major versions of a
> compiler).  To be clear: compilers are *supposed* to be compatible
> across multiple versions (i.e., compile a library with one version of
> the compiler, and then use that library with an application compiled
> by a different version of the compiler), but a) there's other issues,
> such as C++ ABI issues and other run-time bootstrapping that can
> complicate things, and b) bugs in forward and backward compatibility
> happen.

Is that actually observed in GNU/Linux systems?  I'd expect it either to
work or just fail to link.  For instance, the RHEL 6 devtoolset-4 (gcc
5) uses the system libstdc++, and the system compiler is gcc 4.4.

> The short answer is in this FAQ item:
> https://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-after-v1.0.
> Substituting the gcc 5 compiler may work just fine.

For what it's worth, not for GNU Fortran, which unfortunately changes
the module format incompatibly with each release, or at least most
releases.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Launching hybrid MPI/OpenMP jobs on a cluster: correct OpenMPI flags?

2016-10-11 Thread Dave Love
Wirawan Purwanto  writes:

> Instead of the scenario above, I was trying to get the MPI processes
> side-by-side (more like "fill_up" policy in SGE scheduler), i.e. fill
> node 0 first, then fill node 1, and so on. How do I do this properly?
>
> I tried a few attempts that fail:
>
> $ export OMP_NUM_THREADS=2
> $ mpirun -np 16 -map-by core:PE=2 ./EXECUTABLE

...

> Clearly I am not understanding how this map-by works. Could somebody
> help me? There was a wiki article partially written:
>
> https://github.com/open-mpi/ompi/wiki/ProcessPlacement
>
> but unfortunately it is also not clear to me.

Me neither; this stuff has traditionally been quite unclear and really
needs documenting/explaining properly.

This sort of thing from my local instructions for OMPI 1.8 probably does
what you want for OMP_NUM_THREADS=2 (where the qrsh options just get me
a couple of small nodes):

  $ qrsh -pe mpi 24 -l num_proc=12 \
 mpirun -n 12 --map-by slot:PE=2 --bind-to core --report-bindings true |&
 sort -k 4 -n
  [comp544:03093] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././.][./././././.]
  [comp544:03093] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./.][./././././.]
  [comp544:03093] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B][./././././.]
  [comp544:03093] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]]: [./././././.][B/B/./././.]
  [comp544:03093] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket 1[core 
9[hwt 0]]: [./././././.][././B/B/./.]
  [comp544:03093] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket 1[core 
11[hwt 0]]: [./././././.][././././B/B]
  [comp527:03056] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 
1[hwt 0]]: [B/B/./././.][./././././.]
  [comp527:03056] MCW rank 7 bound to socket 0[core 2[hwt 0]], socket 0[core 
3[hwt 0]]: [././B/B/./.][./././././.]
  [comp527:03056] MCW rank 8 bound to socket 0[core 4[hwt 0]], socket 0[core 
5[hwt 0]]: [././././B/B][./././././.]
  [comp527:03056] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket 1[core 
7[hwt 0]]: [./././././.][B/B/./././.]
  [comp527:03056] MCW rank 10 bound to socket 1[core 8[hwt 0]], socket 1[core 
9[hwt 0]]: [./././././.][././B/B/./.]
  [comp527:03056] MCW rank 11 bound to socket 1[core 10[hwt 0]], socket 1[core 
11[hwt 0]]: [./././././.][././././B/B]

I don't remember how I found that out.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] what was the rationale behind rank mapping by socket?

2016-10-11 Thread Dave Love
Gilles Gouaillardet  writes:

> Bennet,
>
>
> my guess is mapping/binding to sockets was deemed the best compromise
> from an
>
> "out of the box" performance point of view.
>
>
> iirc, we did fix some bugs that occured when running under asymmetric
> cpusets/cgroups.
>
> if you still have some issues with the latest Open MPI version (2.0.1)
> and the default policy,
>
> could you please describe them ?

I also don't understand why binding to sockets is the right thing to do.
Binding to cores seems the right default to me, and I set that locally,
with instructions about running OpenMP.  (Isn't that what other
implementations do, which makes them look better?)

I think at least numa should be used, rather than socket.  Knights
Landing, for instance, is single-socket, so no gets no actual binding by
default.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] specifying memory affinity

2016-09-20 Thread Dave Love
I don't think it's possible, but just to check:  can you specify memory
affinity distinct from core binding somehow with OMPI (i.e. not with
hwloc-bind as a shim under mpirun)?

It seems to be relevant in Knight's Landing "hybrid" mode with separate
MCDRAM NUMA nodes as I assume you still want core binding -- or is that
not so?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Compilation without NVML support

2016-09-20 Thread Dave Love
Brice Goglin  writes:

> Hello
> Assuming this NVML detection is actually done by hwloc, I guess there's
> nothing in OMPI to disable it. It's not the first time we get such an
> issue with OMPI not having all hwloc's --disable-foo options, but I
> don't think we actually want to propagate all of them.

I'd build against the system libhwloc, if only for consistency with
other things using hwloc.

However, don't --disable-... options get passed to sub-configures
(possibly with a warning at higher levels)?  I guess that's what Gilles
meant.

> Maybe we should just force several enable_foo=no when OMPI invokes
> hwloc's configury. At least nvml, gl, opencl, libudev are likely useless
> for OMPI.
> Brice

For what it's worth, I've found the nvidia bits harmful in another
situation.  After much head scratching, I found that mysterious bus
errors crashing SGE daemons seemed connected to them, and the crashes
went away when I rebuilt the hwloc library without the stuff.  [I didn't
think I could make a useful bug report.]
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI libraries

2016-09-13 Thread Dave Love
I wrote: 

> Gilles Gouaillardet  writes:
>
>> Mahmood,
>>
>> mpi_siesta is a siesta library, not an Open MPI library.
>>
>> fwiw, you might want to try again from scratch with
>> MPI_INTERFACE=libmpi_f90.a
>> DEFS_MPI=-DMPI
>> in your arch.make
>>
>> i do not think libmpi_f90.a is related to an OpenMPI library.

For completeness/accuracy:  Apologies -- it turns out that's right (as
well as the below, of course).  The link step has

  ... libmpi_f90.a ... -libmpi_f90 ...

so there's at least some excuse for confusion if you haven't looked
twice at the end of the build output.

> libmpi_f90 is the Fortran 90 library in OMPI 1.6, but presumably you
> want the shared, system version.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] MPI libraries

2016-09-12 Thread Dave Love
Gilles Gouaillardet  writes:

> Mahmood,
>
> mpi_siesta is a siesta library, not an Open MPI library.
>
> fwiw, you might want to try again from scratch with
> MPI_INTERFACE=libmpi_f90.a
> DEFS_MPI=-DMPI
> in your arch.make
>
> i do not think libmpi_f90.a is related to an OpenMPI library.

libmpi_f90 is the Fortran 90 library in OMPI 1.6, but presumably you
want the shared, system version.

> if you need some more support, please refer to the siesta doc and/or ask on
> a siesta mailing list

I used the system MPI (which is OMPI 1.6 for historical reasons) and it
seems siesta 4.0 just built on RHEL6 with the rpm spec fragment below,
but I'm sure it would also work with 1.8.  (However, it needs cleaning
up significantly for the intended Fedora packaging.)

  %global _configure ../Src/configure
  cd Obj
  ../Src/obj_setup.sh
  %_openmpi_load
  %configure --enable-mpi
  make # not smp-safe

(%_openmpi_load just does "module load openmpi_x86_64" in this case.)
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] mpi4py/fc20 (was: users Digest, Vol 3592, Issue 1)

2016-09-01 Thread Dave Love
"Mahdi, Sam"  writes:

> To dave, from the installation guide I found, it seemed I couldnt just
> directly download it from the package list, but rather Id need to use the
> mpicc wrapper to compile and install.

That makes no sense to a maintainer of some openmpi Fedora packages, and
I actually have mpi4py-openmpi installed and working from EPEL6.

> I also wanted to see if I could build
> it from the installation guide, sorta learn how the whole process worked.

Well, the spec file tells you how to build on the relevant version of
Fedora, including the dependencies.

> To guilles, do I need to download open mpi directly from the site to obtain
> the mpicc and to get the current version?

You said you already have the openmpi-devel package, which is what
provides it.

I really wouldn't run f20 on a typical HPC system, though.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Certain files for mpi missing when building mpi4py

2016-08-31 Thread Dave Love
"Mahdi, Sam"  writes:

> HI everyone,
>
> I am using a linux fedora. I downloaded/installed
> openmpi-1.7.3-1.fc20(64-bit) and openmpi-devel-1.7.3-1.fc20(64-bit). As
> well as pypar-openmpi-2.1.5_108-3.fc20(64-bit) and
> python3-mpi4py-openmpi-1.3.1-1.fc20(64-bit). The problem I am having is
> building mpi4py using the mpicc wrapper.

Why build it when you have the package?

If you do need to rebuild it for some reason, get the source rpm and
look at the recipe in the .spec file, or edit the .spec and just use
rpmbuild.

[I assume there's a good reason for F20, but it's three versions
obsolete.]

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-18 Thread Dave Love
"Audet, Martin"  writes:

> Hi Josh,
>
> Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all my 
> MPI processes
> and it did improve performance but the performance I obtain isn't completely 
> satisfying.

I raised the issue of MXM hurting p2p latency here a while ago, but
don't have a solution.  Mellanox were here last week and promised to
address that, but I haven't heard back.  I get the impression this stuff
isn't widely used, and since it's proprietary, unlike PSM, we can't
really investigate.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] SGE integration broken in 2.0.0

2016-08-18 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> On Aug 16, 2016, at 3:07 PM, Reuti  wrote:
>> 
>> Thx a bunch - that was it. Despite searching for a solution I found
>> only hints that didn't solve the issue.
>
> FWIW, we talk about this in the HACKING file, but I admit that's not
> necessarily the easiest place to find:
>
> https://github.com/open-mpi/ompi/blob/master/HACKING#L126-L129

autogen.pl tries to check versions of the tools, as one might hope, so
the question is why it fails.  The check works for me on RHEL6 if I
reverse the order of the autoconf and libtool checks.

A related question I should have asked long ago:  I don't suppose it
would have helped to catch this, but why is it necessary to configure
gridengine support specifically?  It doesn't need library support and
seems harmless at run time if you're not using gridengine, as it just
needs environment variables which are unlikely to be wrongly set --
other things rely on them to distinguish resource managers.  Similarly
for any other resource manager that works like that.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-22 Thread Dave Love
Rob Nagler  writes:

> Thanks, John. I sometimes wonder if I'm the only one out there with this
> particular problem.
>
> Ralph, thanks for sticking with me. :) Using a pool of uids doesn't really
> work due to the way cgroups/containers works. It also would require
> changing the permissions of all of the user's files, which would create
> issues for Jupyter/Hub's access to the files, which is used for in situ
> monitoring.

Skimming back at this, like Ralph I really don't understand it as a
maintainer of a resource manager (at a level above Ralph's) and as
someone who formerly had the "pleasure" of HEP requirements which
attempted to defeat essentially any reasonable management policy.  (It
seems off-topic here.)

Amongst reasons for not running Docker, a major one that I didn't notice
raised is that containers are not started by the resource manager, but
by a privileged daemon, so the resource manager can't directly control
or monitor them.

>From a brief look at Jupyter when it came up a while ago, I wouldn't
want to run it, and I wasn't alone.  (I've been lectured about the lack
of problems with such things by people on whose clusters I could
trivially run jobs as any normal user and sometimes as root.)

+1 for what Ralph said about singularity in particular.  While there's
work to be done, you could even convert docker images on the fly in a
resource manager prolog.  I'm awaiting enlightenment on the on-topic
issue of running MPI jobs with it, though.


Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-22 Thread Dave Love
"Llolsten Kaonga"  writes:

> Hello Grigory,
>
> I am not sure what Redhat does exactly but when you install the OS, there is
> always an InfiniBand Support module during the installation process. We
> never check/install that module when we do OS installations because it is
> usually several versions of OFED behind (almost obsolete).

In addition to what Peter Kjellström said:  Do you have evidence of
actual significant problems with RH's IB support?  It was an improvement
to throw out our vendor's OFED offering.  Also why run RHEL if you're
going to use things which will presumably prevent you getting support in
important areas?  (At least two OFED components were maintained by a Red
Hat employee last I knew.)


[OMPI users] 2.0 documentation

2016-06-22 Thread Dave Love
I know it's not traditional, but is there any chance of complete
documentation of the important changes in v2.0?  Currently NEWS mentions
things like minor build issues, but there's nothing, for instance, on
the addition and removal of whole frameworks, one of which I've been
trying to understand.


Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-25 Thread Dave Love
I wrote: 

> You could wrap one (set of) program(s) in a script to set the
> appropriate environment before invoking the real program.  

I realize I should have said something like "program invocations",
i.e. if you have no control over something invoking mpirun for programs
using different MPIs, then an mpirun wrapper needs to check what it's
being asked to run.


Re: [OMPI users] users Digest, Vol 3510, Issue 2

2016-05-24 Thread Dave Love
Megdich Islem  writes:

> Yes, Empire does the fluid structure coupling. It couples OpenFoam (fluid 
> analysis) and Abaqus (structural analysis).
> Does all the software need to have the same MPI architecture in order to 
> communicate ?

I doubt it's doing that, and presumably you have no control over abaqus,
which is a major source of pain here.

You could wrap one (set of) program(s) in a script to set the
appropriate environment before invoking the real program.  That might be
a bit painful if you need many of the OF components, but it should be
straightforward to put scripts somewhere on PATH ahead of the real
versions.

On the other hand, it never ceases to amaze how difficult proprietary
engineering applications make life on HPC systems; I could believe
there's a catch.  Also you (or systems people) normally want programs to
use the system MPI, assuming that's been set up appropriately.


Re: [OMPI users] wtime implementation in 1.10

2016-05-24 Thread Dave Love
Ralph Castain  writes:

> Nobody ever filed a PR to update the branch with the patch - looks
> like you never responded to confirm that George’s proposed patch was
> acceptable.

I've never seen anything asking me about it, but I'm not an OMPI
developer in a position to review backports or even put things in a bug
tracker.

1.10 isn't used here, and I just subvert gettimeofday whenever I'm
running something that might use it for timing short intervals.

> I’ll create the PR and copy you for review
>
>
>> On May 23, 2016, at 9:17 AM, Dave Love  wrote:
>> 
>> I thought the 1.10 branch had been fixed to use clock_gettime for
>> MPI_Wtime where it's available, a la
>> https://www.open-mpi.org/community/lists/users/2016/04/28899.php -- and
>> have been telling people so!  However, I realize it hasn't, and it looks
>> as if 1.10 is still being maintained.
>> 
>> Is there a good reason for that, or could it be fixed?


[OMPI users] wtime implementation in 1.10

2016-05-23 Thread Dave Love
I thought the 1.10 branch had been fixed to use clock_gettime for
MPI_Wtime where it's available, a la
https://www.open-mpi.org/community/lists/users/2016/04/28899.php -- and
have been telling people so!  However, I realize it hasn't, and it looks
as if 1.10 is still being maintained.

Is there a good reason for that, or could it be fixed?


Re: [OMPI users] OpenMPI 1.6.5 on CentOS 7.1, silence ib-locked-pages?

2016-05-20 Thread Dave Love
Ryan Novosielski  writes:

> I’m pretty sure this is no longer relevant (having read Roland’s
> messages about it from a couple of years ago now). Can you please
> confirm that for me, and then let me know if there is any way that I
> can silence this old copy of OpenMPI that I need to use with some
> software that depends on it for some reason? It is causing my users to
> report it as an issue pretty regularly.

Does following the FAQ not have any effect?  I don't see it would do
much harm anyway.

[For what it's worth, the warning still occurs here on a very large
memory system with the recommended settings.]


Re: [OMPI users] Building vs packaging

2016-05-20 Thread Dave Love
dani  writes:

> I don't know about .deb packages, but at least in the rpms there is a
> post install scriptlet that re-runs ldconfig to ensure the new libs
> are in the ldconfig cache.

MPI packages following the Fedora guidelines don't do that (and rpmlint
complains bitterly as a consequence).  They rely on LD_LIBRARY_PATH via
environment modules, for better or worse:

  $ mock --shell 'rpm -q openmpi; rpm -q --scripts openmpi' 2>/dev/null
  openmpi-1.8.1-1.el6.x86_64
  $ 

[Using mock for a vanilla environment.]


Re: [OMPI users] Question about mpirun mca_oob_tcp_recv_handler error.

2016-05-16 Thread Dave Love
Ralph Castain  writes:

> This usually indicates that the remote process is using a different OMPI
> version. You might check to ensure that the paths on the remote nodes are
> correct.

That seems quite a common problem with non-obvious failure modes.

Is it not possible to have a mechanism that checks the consistency of
the components and aborts in a clear way?  I've never thought it out,
but it seems that some combination of OOB messages, library versioning
(at least with ELF) and environment variables might do it.


Re: [OMPI users] No core dump in some cases

2016-05-16 Thread Dave Love
Gilles Gouaillardet  writes:

> Are you sure ulimit -c unlimited is *really* applied on all hosts
>
>
> can you please run the simple program below and confirm that ?

Nothing specifically wrong with that, but it's worth installing
procenv(1) as a general solution to checking the (generalized)
environment of a job.  It's packaged for Debian/Ubuntu and Fedora/EPEL,
at least.


Re: [OMPI users] Building vs packaging

2016-05-16 Thread Dave Love
"Rob Malpass"  writes:

> Almost in desperation, I cheated:

Why is that cheating?  Unless you specifically want a different version,
it seems sensible to me, especially as you then have access to packaged
versions of at least some MPI programs.  Likewise with rpm-based
systems, which I'm afraid I know more about.

Also the package system ensures that things don't break by inadvertently
removing their dependencies; the hwloc libraries might be an example.

> sudo  apt-get install openmpi-bin
>
>  
>
> and hey presto.   I can now do (from head node)
>
>  
>
> mpirun -H node2,node3,node4 -n 10 foo
>
>  
>
> and it works fine.   So clearly apt-get install has set something that I'd
> not done (and it's seemingly not LD_LIBRARY_PATH) as ssh node2 'echo
> $LD_LIBRARY_PATH' still returns a blank line.

No.  As I said recently, Debian installs a default MPI (via the
alternatives system) with libraries in the system search path.  Check
the library contents.


Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-05-06 Thread Dave Love
Gus Correa  writes:

> Hi Giacomo
>
> Some programs fail with segmentation fault
> because the stack size is too small.

Yes, the default for Intel Fortran is to allocate large-ish amounts on
the stack, which may matter when the compiled program runs.

However, look at the backtrace.  It's apparently coming from the loader,
so something is pretty screwed up, though I can't guess what.  It would
help to have debugging symbols; always use at least -g and have
GNU/Linux distribution debuginfo packages to hand.

[Probably not relevant in this case, but I try to solve problems with
the Intel compiler and MPI (sorry Jeff et al) by persuading users to
avoid them.  GCC is more reliable in my experience, and the story about
its supposedly poor code generation isn't supported by experiment (if
that counts for anything these days).]

> [But others because of bugs in memory allocation/management, etc.]
>
> Have you tried
>
> ulimit -s unlimited
>
> before you run the program?
>
> Are you using a single machine or a cluster?
> If you're using infiniband you may need also to make the locked memory
> unlimited:
>
> ulimit -l unlimited
>
> I hope this helps,
> Gus Correa
>
> On 05/05/2016 05:15 AM, Giacomo Rossi wrote:
>>   gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
>> GNU gdb (GDB) 7.11
>> Copyright (C) 2016 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> 
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-pc-linux-gnu".
>> Type "show configuration" for configuration details.
>> For bug reporting instructions, please see:
>> .
>> Find the GDB manual and other documentation resources online at:
>> .
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90...(no
>> debugging symbols found)...done.
>> (gdb) r -v
>> Starting program: /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90 -v
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x76858f38 in ?? ()
>> (gdb) bt
>> #0  0x76858f38 in ?? ()
>> #1  0x77de5828 in _dl_relocate_object () from
>> /lib64/ld-linux-x86-64.so.2
>> #2  0x77ddcfa3 in dl_main () from /lib64/ld-linux-x86-64.so.2
>> #3  0x77df029c in _dl_sysdep_start () from
>> /lib64/ld-linux-x86-64.so.2
>> #4  0x774a in _dl_start () from /lib64/ld-linux-x86-64.so.2
>> #5  0x77dd9d98 in _start () from /lib64/ld-linux-x86-64.so.2
>> #6  0x0002 in ?? ()
>> #7  0x7fffaa8a in ?? ()
>> #8  0x7fffaab6 in ?? ()
>> #9  0x in ?? ()
>>
>> Giacomo Rossi Ph.D., Space Engineer
>>
>> Research Fellow at Dept. of Mechanical and Aerospace Engineering,
>> "Sapienza" University of Rome
>> *p: *(+39) 0692927207 | *m**: *(+39) 3408816643 | *e:
>> *giacom...@gmail.com 
>> 
>> Member of Fortran-FOSS-programmers
>> 


Re: [OMPI users] barrier algorithm 5

2016-05-06 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
>
> i made PR #1644 to abort with a user friendly error message
>
> https://github.com/open-mpi/ompi/pull/1644

Thanks.  Could there be similar cases that might be worth a change?


[OMPI users] SLOAVx alltoallv

2016-05-06 Thread Dave Love
At the risk of banging on too much about collectives:

I came across a writeup of the "SLOAVx" algorithm for alltoallv
.  It was implemented
in OMPI with apparently good results, but I can't find any code.

I wonder if anyone knows the story on that.  Was it not contributed, or
is it actually not worthwhile?  Otherwise, might it be worth investigating?


Re: [OMPI users] barrier algorithm 5

2016-05-04 Thread Dave Love
Gilles Gouaillardet  writes:

> Dave,
>
> yes, this is for two MPI tasks only.
>
> the MPI subroutine could/should return with an error if the communicator is
> made of more than 3 tasks.
> an other option would be to abort at initialization time if no collective
> modules provide a barrier implementation.
> or maybe the tuned module should have not used the two_procs algorithm, but
> what should it do instead ? use a default one ? do not implement barrier ?
> warn/error the end user ?
>
> note the error message might be a bit obscure.
>
> I write "could" because you explicitly forced something that cannot work,
> and I am not convinced OpenMPI should protect end users from themselves,
> even when they make an honest mistake.

I just looped over the available algorithms, not expecting any not to
work.  One question is how I'd know it can't work; I can't find
documentation on the algorithms, just the more-or-less suggestive names
that I might be able to find in the literature, or not.  Is there a good
place to look?

In the absence of a good reason why not -- I haven't looked at the code
-- but I'd expect it to abort with a message about the algorithm being
limited to two processes at some stage.  Of course, this isn't a common
case, and people probably have more important things to do.


[OMPI users] barrier algorithm 5

2016-05-04 Thread Dave Love
With OMPI 1.10.2 and earlier on Infiniband, IMB generally spins with no
output for the barrier benchmark if you run it with algorithm 5, i.e.

  mpirun --mca coll_tuned_use_dynamic_rules 1 --mca 
coll_tuned_barrier_algorithm 5 IMB-MPI1 barrier

This is "two proc only".  Does that mean it will only work for two
processes (which seems true experimentally)?  If so, should it report an
error if used with more?


Re: [OMPI users] Ubuntu and LD_LIBRARY_PATH

2016-05-03 Thread Dave Love
John Hearns  writes:

> May I ask though - what is the purpose of your cluster?
> If you are using Ubunutu, have you looked at Qlustar?
> https://www.qlustar.com/
> Might save you a whole lot of heartache!

Well, proprietary cluster management systems have only given me grief
until I've replaced them.  Anyhow, it shouldn't be necessary to
re-install a cluster to make the MPI work.

The main reason I see for not using distribution packages directly is
that they're typically not built with the features you need, e.g. your
parallel filesystem, knem, a different interconnect &c., but in that
case it's normally easy to rebuild from modified package source with
extra build requirements and modified configure options.


[OMPI users] collective tuning (was: MPI_Bcast implementations in OpenMPI)

2016-05-03 Thread Dave Love
George Bosilca  writes:

>> On Apr 25, 2016, at 11:33 , Dave Love  wrote:
>> 
>> George Bosilca  writes:

>>> I have recently reshuffled the tuned module to move all the algorithms
>>> in the base and therefore make them available to other collective
>>> modules (the code is available in master and 1.10 and the future
>>> 2.0). This move has the potential for allowing different decision
>>> schemes to coexists, and be dynamically selected at runtime based on
>>> network properties, network topology, or even applications needs. I
>>> continue to have hopes that network vendors will eventually get
>>> interested in tailoring the collective selection to match their
>>> network capabilities, and provide their users with a performance boost
>>> by allowing for network specific algorithm selection.
>> 
>> That sounds useful, assuming the speed is generally dominated by the
>> basic fabric.  What's involved in making the relevant measurements and
>> plugging them in?  I did look at using OTPO(?) to check this sort of
>> thing once.  I couldn't make it work in the time I had, but Periscope
>> might be a good alternative now.
>
> It is a multidimensional space optimization problem.

Sure, but it's not clear to me that I understand it well enough to
optimize in principle.

> The critical
> point is identifying the switching points between different algorithms
> based on their performance (taking in account, at least, physical
> topology, number of processes and amount of data).

Runs of IMB don't necessarily reveal clear switch points (which I could
believe means there's something wrong with them...).

> The paper I sent on
> one of my previous email discusses how we did the decision functions
> on the current implementation. There are certainly better ways, but
> the one we took at least did not involve any extra software, and was
> done using simple scripts.

I'd looked at it, but I couldn't see much about doing the measurements.
I thought there was a paper (from UTK?) on the OMPI web site which was
more about that, but I can't find it.

>> If it's fairly mechanical -- maybe even if not -- it seems like
>> something that should just be done regardless of vendors.  I'm sure
>> plenty of people could measure QDR fat tree, for a start (at least where
>> measurement isn’t frowned upon).
>
> Based on feedback from the user mailing list, several users did such
> optimizations for their specific applications.

That sort of thing is mainly what prompted me to ask.  (And I see plenty
of pretty useless benchmark-type "studies" that make more-or-less
absolute statements about MPIs' relative speed without even saying what
parameters were used.)  One thing I don't know is whether this is likely
to be significantly application specific, as I've seen suggested.

Presumably there's m(va)pich work on this that might be useful if they
use the same algorithms, but I couldn't find a relevant write-up.

> This makes the
> optimization problem much simpler, as some of the parameters have
> discrete values (message size). If we assume a symmetric network, and
> have a small number of message sizes of interest, it is enough to run
> few benchmarks (skampi, to the IMB test on the collective of
> interest), and manually finding the switch point is a relatively
> simple process.

I've looked at alltoallv, which is important for typical chemistry codes
whose users have an insatiable appetite for cycles.  To start with it's
not clear how useful IMB is as it's not exercising the "v".  Then for
low-ish process counts I've seen the relative speed of the two
algorithms all over the place.  However, 2 appears best overall, but
when I profiled an application, I got ~30% speedup by switching to 1.
To a hard-bitten experimentalist, this just suggests too little
understanding to make useful measurements, and that it would be useful
to have a good review of the issues -- presumably for current sorts of
interconnect.  Does one exist?


Re: [OMPI users] Ubuntu and LD_LIBRARY_PATH

2016-04-26 Thread Dave Love
"Rob Malpass"  writes:

> Hi 
>
>  
>
> Sorry if this isn't 100% relevant to this list but I'm at my wits end.
>
>  
>
> After a lot of hacking, I've finally configured openmpi on my Ubuntu
> cluster.   I had been having awful problems with not being able to find the
> libraries on the remote nodes but apparently the workaround is to use
> ld.conf.so.d/*.conf

That shouldn't be necessary with Debian/Ubuntu packages; there's a
default MPI set through alternatives.  If that isn't working, make an
Ubuntu bug report, but it seems OK in Debian stable.

If you're not using a packaged version (why?), the usual way to set the
environment is with environment modules (the environment-modules
package).


Re: [OMPI users] Porting MPI-3 C-program to Fortran

2016-04-25 Thread Dave Love
Tom Rosmond  writes:

> Thanks for replying, but the difference between what can be done in C
> vs fortran is still my problem.  I apologize for my rudimentary
> understanding of C, but here is a brief summary:

I'm not an expert on this stuff, just cautioning about Fortran semantics
where I could imagine it's important.

I guess you'll need to read what the standard says about the routine,
and use a compiler that supports iso_c_binding, which is the right way
to interface to C.  (The OMPI man pages don't document the modern
Fortran interface, unfortunately.)

If it's not clear how to use it, perhaps one of the Committee can clarify.


Re: [OMPI users] MPI_Bcast implementations in OpenMPI

2016-04-25 Thread Dave Love
George Bosilca  writes:

> Dave,
>
> You are absolutely right, the parameters are now 6-7 years old,
> gathered on interconnects long gone. Moreover, several discussions in
> this mailing list indicated that they do not match current network
> capabilities.
>
> I have recently reshuffled the tuned module to move all the algorithms
> in the base and therefore make them available to other collective
> modules (the code is available in master and 1.10 and the future
> 2.0). This move has the potential for allowing different decision
> schemes to coexists, and be dynamically selected at runtime based on
> network properties, network topology, or even applications needs. I
> continue to have hopes that network vendors will eventually get
> interested in tailoring the collective selection to match their
> network capabilities, and provide their users with a performance boost
> by allowing for network specific algorithm selection.

That sounds useful, assuming the speed is generally dominated by the
basic fabric.  What's involved in making the relevant measurements and
plugging them in?  I did look at using OTPO(?) to check this sort of
thing once.  I couldn't make it work in the time I had, but Periscope
might be a good alternative now.

If it's fairly mechanical -- maybe even if not -- it seems like
something that should just be done regardless of vendors.  I'm sure
plenty of people could measure QDR fat tree, for a start (at least where
measurement isn't frowned upon).


Re: [OMPI users] Porting MPI-3 C-program to Fortran

2016-04-22 Thread Dave Love
Jeff Hammond  writes:

> MPI uses void** arguments to pass pointer by reference so it can be
> updated. In Fortran, you always pass by reference so you don't need
> this.

I don't know if it's relevant in this case, but that's not generally
true (even for Fortran 77, for which I used to know the standard
more-or-less by heart).  It definitely isn't true for gfortran, and I'm
confident not for the Intel compiler, or it would miss optimizations.
You may get away with assuming call-by-reference, but you're likely to
get bitten if you don't obey the argument association rules.


Re: [OMPI users] MPI_Bcast implementations in OpenMPI

2016-04-22 Thread Dave Love
George Bosilca  writes:

> Matthieu,
>
> If you are talking about how Open MPI selects between different broadcast
> algorithms you might want to read [1]. We have implemented a dozen
> different broadcast algorithms and have run a set of tests to measure their
> performance. 

I'd been meaning to ask about this sort of thing as I didn't anything
written about it.

It seems the measurements on which at collective parameter defaults are
based were originally from interconnects with at least an order of
magnitude difference in performance from typical current ones, and maybe
different topology.

Have parameters been revisited since, or is it clear they'll still be
valid for, say, FDR IB?  I know a case that was changed a while ago, but
the new alltoallv default algorithm hurt performance on typical
chemistry code that might constitute the majority of its use, and it
wasn't clear why the change was made.

I assume it could be useful to know how things were derived to indicate
when it might be worth trying different values as it often seems
worthwhile to do so.


Re: [OMPI users] resolution of MPI_Wtime

2016-04-11 Thread Dave Love
George Bosilca  writes:

> MPI_Wtick is not about the precision but about the resolution of the
> underlying timer (aka. the best you can hope to get).

What's the distinction here?  (clock_getres(2) says "resolution
(precision)".)

My point (like JH's?) is that it doesn't generally return the interval
between ticks, so it doesn't either seem very useful or obey the spec --
whether or not the spec is reasonable or the clock has reasonable
resolution.

For instance, on a (particular?) sandybridge system, the interval for
CLOCK_MONOTONIC is experimentally ~30ns, not 1; clock_getres itself
isn't accurate.  On a particular core2 system, it appears to be an order
of magnitude bigger, but the clock ticks at least once per call in that
case.

> Thus, the measured
> time will certainly be larger, but, and this is almost a certainty, it will
> hardly be smaller.  As a result, I am doubtful that an MPI implementation
> will provide an MPI_Wtime with a practical resolution smaller that whatever
> the corresponding MPI_Wtick returns.

I don't think it's an issue not having a lower bound on resolution, but
isn't that the case with non-Linux high-res timers used by OMPI now?  My
technique when gettimeofday turns up as a timer is to replace it on
Linux with clock_gettime via an LD_PRELOAD, which seems legitimate.

Not understanding this can definitely lead to bogus results on the happy
occasions when users and others are actually prepared to make
measurements, and despite the general practice, measurements without
good error estimates are pretty meaningless.  (No apologies for an
experimental physics background!)  The example which got me looking at
current OMPI was link latency tests which suggested there was something
badly wrong with the fabric.


  1   2   3   4   >