Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-07-03 Thread Nathan Hjelm via users

Found this issue. PR #5374 fixes it. Will make its way into the v3.0.x and 
v3.1.x release series.

-Nathan

On Jul 02, 2018, at 02:36 PM, Nathan Hjelm  wrote:


The result should be the same with v3.1.1. I will investigate on our Coral test 
systems.

-Nathan

On Jul 02, 2018, at 02:23 PM, "Hammond, Simon David via users" 
 wrote:

Howard,

 

This fixed the issue with OpenMPI 3.1.0. Do you want me to try the same with 
3.1.1 as well?

 

S.

 

-- 

Si Hammond

Scalable Computer Architectures

Sandia National Laboratories, NM, USA

 

 

From: users  on behalf of Howard Pritchard 

Reply-To: Open MPI Users 
Date: Monday, July 2, 2018 at 1:34 PM
To: Open MPI Users 
Subject: Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ 
CUDA9.2

 

HI Si,

 

Could you add --disable-builtin-atomics

 

to the configure options and see if the hang goes away?

 

Howard

 

 

2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users 
:

Simon --

You don't currently have another Open MPI installation in your PATH / 
LD_LIBRARY_PATH, do you?

I have seen dependency library loads cause "make check" to get confused, and 
instead of loading the libraries from the build tree, actually load some -- but not all 
-- of the required OMPI/ORTE/OPAL/etc. libraries from an installation tree.  Hilarity 
ensues (to include symptoms such as running forever).

Can you double check that you have no Open MPI libraries in your LD_LIBRARY_PATH before 
running "make check" on the build tree?





On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users 
 wrote:

Nathan,

Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.

S.

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]


On 6/16/18, 10:10 PM, "Nathan Hjelm"  wrote:

    Try the latest nightly tarball for v3.1.x. Should be fixed. 


On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users 
 wrote:

The output from the test in question is:

Single thread test. Time: 0 s 10182 us 10 nsec/poppush
Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush


S.

--
Si Hammond
Scalable Computer Architectures
Sandia National Laboratories, NM, USA
[Sent from remote connection, excuse typos]


On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:

   Hi OpenMPI Team,

   We have recently updated an install of OpenMPI on POWER9 system (configuration details 
below). We migrated from OpenMPI 2.1 to OpenMPI 3.1. We seem to have a symptom where code 
than ran before is now locking up and making no progress, getting stuck in wait-all 
operations. While I think it's prudent for us to root cause this a little more, I have 
gone back and rebuilt MPI and re-run the "make check" tests. The opal_fifo test 
appears to hang forever. I am not sure if this is the cause of our issue but wanted to 
report that we are seeing this on our system.

   OpenMPI 3.1.0 Configuration:

   ./configure 
--prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.1.0-nomxm/gcc/7.2.0/cuda/9.2.88
 --with-cuda=$CUDA_ROOT --enable-mpi-java --enable-java 
--with-lsf=/opt/lsf/10.1 
--with-lsf-libdir=/opt/lsf/10.1/linux3.10-glibc2.17-ppc64le/lib --with-verbs

   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA for 
POWER9 (standard download from their website). We enable IBM's JDK 8.0.0.
   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)

   Output:

   make[3]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   make[4]: Entering directory `/home/sdhammo/openmpi/openmpi-3.1.0/test/class'
   PASS: ompi_rb_tree
   PASS: opal_bitmap
   PASS: opal_hash_table
   PASS: opal_proc_table
   PASS: opal_tree
   PASS: opal_list
   PASS: opal_value_array
   PASS: opal_pointer_array
   PASS: opal_lifo
   

   Output from Top:

   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo

   -- 
   Si Hammond

   Scalable Computer Architectures
   Sandia National Laboratories, NM, USA
   [Sent from remote connection, excuse typos]




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Jeff Squyres
jsquy...@cisco.com


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

 

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Verbose output for MPI

2018-07-04 Thread Nathan Hjelm via users
--mca pmix_base_verbose 100

> On Jul 4, 2018, at 9:15 AM, Maksym Planeta  
> wrote:
> 
> Hello,
> 
> I have troubles figuring out how can I configure verbose output properly. 
> There is a call to pmix_output_verbose in 
> opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp.c in function try_connect:
> 
>pmix_output_verbose(2, pmix_ptl_base_framework.framework_output,
>"pmix:tcp try connect to %s",
>mca_ptl_tcp_component.super.uri);
> 
> I'm confident that the control flow goes through this function call, because 
> I see a log message from line 692:
> 
> PMIX ERROR: ERROR STRING NOT FOUND in file ptl_tcp.c at line 692
> 
> But my attempts to configure mca parameters properly failed.
> 
> Could you help me with the exact parameters to force the pmix_output_verbose 
> be active?
> 
> -- 
> Regards,
> Maksym Planeta
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] MPI_Ialltoallv

2018-07-06 Thread Nathan Hjelm via users
No, thats a bug. Please open an issue on github and we will fix it shortly.

Thanks for reporting this issue.

-Nathan

> On Jul 6, 2018, at 8:08 AM, Stanfield, Clyde 
>  wrote:
> 
> We are using MPI_Ialltoallv for an image processing algorithm. When doing 
> this we pass in an MPI_Type_contiguous with an MPI_Datatype of 
> MPI_C_FLOAT_COMPLEX which ends up being the size of multiple rows of the 
> image (based on the number of nodes used for distribution). In addition 
> sendcounts, sdispls, resvcounts, and rdispls all fit within a signed int. 
> Usually this works without any issues, but when we lower our number of nodes 
> we sometimes see failures.
> 
> What we found is that even though we can fit everything into signed ints, 
> line 528 of nbc_internal.h ends up calling a malloc with an int that appears 
> to be the size of the (num_distributed_rows * num_columns  * 
> sizeof(std::complex)) which in very large cases wraps back to 
> negative.  As a result we end up seeing “Error in malloc()” (line 530 of 
> nbc_internal.h) throughout our output.
> 
> We can get around this issue by ensuring the sum of our contiguous type never 
> exceeds 2GB. However, this was unexpected to us as our understanding was that 
> all long as we can fit all the parts into signed ints we should be able to 
> transfer more than 2GB at a time. Is it intended that MPI_Ialltoallv requires 
> your underlying data to be less than 2GB or is this in error in how malloc is 
> being called (should be called with a size_t instead of an int)?
> 
> Thanks,
> Clyde Stanfield
> 
> 
> Clyde Stanfield
> Software Engineer
> 734-480-5100 office
> clyde.stanfi...@mdaus.com
>  
> 
> 
> 
> 
> The information contained in this communication is confidential, is intended 
> only for the use of the recipient(s) named above, and may be legally 
> privileged. If the reader of this message is not the intended recipient, you 
> are hereby notified that any dissemination, distribution, or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please re-send this communication to the sender and delete the 
> original message or any copy of it from your computer system. 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Seg fault in opal_progress

2018-07-11 Thread Nathan Hjelm via users

Might be also worth testing a master snapshot and see if that fixes the issue. 
There are a couple of fixes being backported from master to v3.0.x and v3.1.x 
now.

-Nathan

On Jul 11, 2018, at 03:16 PM, Noam Bernstein  
wrote:

On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users 
 wrote:
Ok, that would be great -- thanks.

Recompiling Open MPI with --enable-debug will turn on several debugging/sanity 
checks inside Open MPI, and it will also enable debugging symbols.  Hence, If 
you can get a failure when a debug Open MPI build, it might give you a core 
file that can be used to get a more detailed stack trace, poke around and see 
if there's a NULL pointer somewhere, …etc.

I haven’t tried to get a core file yes, but it’s not producing any more info 
from the runtime stack trace, despite configure with —enable-debug:

Image              PC                Routine            Line        Source
vasp.gamma_para.i  02DCE8C1  Unknown               Unknown  Unknown
vasp.gamma_para.i  02DCC9FB  Unknown               Unknown  Unknown
vasp.gamma_para.i  02D409E4  Unknown               Unknown  Unknown
vasp.gamma_para.i  02D407F6  Unknown               Unknown  Unknown
vasp.gamma_para.i  02CDCED9  Unknown               Unknown  Unknown
vasp.gamma_para.i  02CE3DB6  Unknown               Unknown  Unknown
libpthread-2.12.s  003F8E60F7E0  Unknown               Unknown  Unknown
mca_btl_vader.so   2B1AFA5FAC30  Unknown               Unknown  Unknown
mca_btl_vader.so   2B1AFA5FD00D  Unknown               Unknown  Unknown
libopen-pal.so.40  2B1AE884327C  opal_progress         Unknown  Unknown
mca_pml_ob1.so     2B1AFB855DCE  Unknown               Unknown  Unknown
mca_pml_ob1.so     2B1AFB858305  mca_pml_ob1_send      Unknown  Unknown
libmpi.so.40.10.1  2B1AE823A5DA  ompi_coll_base_al     Unknown  Unknown
mca_coll_tuned.so  2B1AFC6F0842  ompi_coll_tuned_a     Unknown  Unknown
libmpi.so.40.10.1  2B1AE81B66F5  PMPI_Allreduce        Unknown  Unknown
libmpi_mpifh.so.4  2B1AE7F2259B  mpi_allreduce_        Unknown  Unknown
vasp.gamma_para.i  0042D1ED  m_sum_d_                 1300  mpi.F
vasp.gamma_para.i  0089947D  nonl_mp_vnlacc_.R        1754  nonl.F
vasp.gamma_para.i  00972C51  hamil_mp_hamiltmu         825  hamil.F
vasp.gamma_para.i  01BD2608  david_mp_eddav_.R         419  davidson.F
vasp.gamma_para.i  01D2179E  elmin_.R                  424  electron.F
vasp.gamma_para.i  02B92452  vamp_IP_electroni        4783  main.F
vasp.gamma_para.i  02B6E173  MAIN__                   2800  main.F
vasp.gamma_para.i  0041325E  Unknown               Unknown  Unknown
libc-2.12.so       003F8E21ED1D  __libc_start_main     Unknown  Unknown
vasp.gamma_para.i  00413169  Unknown               Unknown  Unknown

This is the configure line that was supposedly used to create the library:
  ./configure --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080 
--with-tm=/usr/local/torque --enable-mpirun-prefix-by-default --with-verbs=/usr 
--with-verbs-libdir=/usr/lib64 --enable-debug

Is there any way I can confirm that the version of the openmpi library I think 
I’m using really was compiled with debugging?

Noam




|
|

|U.S. NAVAL|

|_RESEARCH_|


LABORATORY


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Seg fault in opal_progress

2018-07-13 Thread Nathan Hjelm via users
Please give master a try. This looks like another signature of running out of 
space for shared memory buffers.

-Nathan

> On Jul 13, 2018, at 6:41 PM, Noam Bernstein  
> wrote:
> 
> Just to summarize for the list.  With Jeff’s prodding I got it generating 
> core files with the debug (and mem-debug) version of openmpi, and below is 
> the kind of stack trace I’m getting from gdb.  It looks slightly different 
> when I use a slightly different implementation that doesn’t use MPI_INPLACE, 
> but nearly the same.  The array that’s being summed is not large, 3776 
> doubles.
> 
> 
> #0  0x003160a32495 in raise (sig=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x003160a33bfd in abort () at abort.c:121
> #2  0x02a3903e in for__issue_diagnostic ()
> #3  0x02a3ff66 in for__signal_handler ()
> #4  
> #5  0x2b67a4217029 in mca_btl_vader_check_fboxes () at 
> btl_vader_fbox.h:208
> #6  0x2b67a421962e in mca_btl_vader_component_progress () at 
> btl_vader_component.c:724
> #7  0x2b67934fd311 in opal_progress () at runtime/opal_progress.c:229
> #8  0x2b6792e2f0df in ompi_request_wait_completion (req=0xe863600) at 
> ../ompi/request/request.h:415
> #9  0x2b6792e2f122 in ompi_request_default_wait (req_ptr=0x7ffebdbb8c20, 
> status=0x0) at request/req_wait.c:42
> #10 0x2b6792ed7d5a in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, 
> rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, op=0x2b6793192380, 
> comm=0xe14c9c0, module=0xe14f8b0)
> at base/coll_base_allreduce.c:460
> #11 0x2b67a6ccb3e2 in ompi_coll_tuned_allreduce_intra_dec_fixed 
> (sbuf=0x1, rbuf=0xeb79ca0, count=3776, dtype=0x2b679317dd40, 
> op=0x2b6793192380, comm=0xe14c9c0, module=0xe14f8b0)
> at coll_tuned_decision_fixed.c:74
> #12 0x2b6792e4d9b0 in PMPI_Allreduce (sendbuf=0x1, recvbuf=0xeb79ca0, 
> count=3776, datatype=0x2b679317dd40, op=0x2b6793192380, comm=0xe14c9c0) at 
> pallreduce.c:113
> #13 0x2b6792bb6287 in ompi_allreduce_f (sendbuf=0x1  bounds>,
> recvbuf=0xeb79ca0 
> "\310,&AYI\257\276\031\372\214\223\270-y>\207\066\226\003W\f\240\276\334'}\225\376\336\277>\227§\231",
>  count=0x7ffebdbbc4d4, datatype=0x2b48f5c, op=0x2b48f60,
> comm=0x5a0ae60, ierr=0x7ffebdbb8f60) at pallreduce_f.c:87
> #14 0x0042991b in m_sumb_d (comm=..., vec=..., n=Cannot access memory 
> at address 0x928
> ) at mpi.F:870
> #15 m_sum_d (comm=..., vec=..., n=Cannot access memory at address 0x928
> ) at mpi.F:3184
> #16 0x01b22b83 in david::eddav (hamiltonian=..., p=Cannot access 
> memory at address 0x1
> ) at davidson.F:779
> #17 0x01c6ef0e in elmin (hamiltonian=..., kineden=Cannot access 
> memory at address 0x19
> ) at electron.F:424
> #18 0x02a108b2 in electronic_optimization () at main.F:4783
> #19 0x029ec5d3 in vamp () at main.F:2800
> #20 0x004100de in main ()
> #21 0x003160a1ed1d in __libc_start_main (main=0x4100b0 , argc=1, 
> ubp_av=0x7ffebdbc5e38, init=, fini= out>, rtld_fini=,
> stack_end=0x7ffebdbc5e28) at libc-start.c:226
> #22 0x0040ffe9 in _start ()
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] local communicator and crash of the code

2018-08-03 Thread Nathan Hjelm via users
If your are trying to create a communicator containing all node local processes 
then use MPI_Comm_split_type. 

> On Aug 3, 2018, at 12:24 PM, Diego Avesani  wrote:
> 
> Deal all,
> probably I have found the error.
> Let's me check. Probably I have not properly set-up colors.
> 
> Thanks a lot,
> I hope that you have not lost too much time for me,
> I will let you know If that was the problem.
> 
> Thanks again
> 
> Diego
> 
> 
>> On 3 August 2018 at 19:57, Diego Avesani  wrote:
>> Dear R, Dear all,
>> 
>> I do not know. 
>> I have isolated the issues. It seem that I have some problem with:
>>   CALL 
>> MPI_COMM_SPLIT(MPI_COMM_WORLD,colorl,MPIworld%rank,MPI_LOCAL_COMM,MPIworld%iErr)
>>   CALL MPI_COMM_RANK(MPI_LOCAL_COMM, MPIlocal%rank,MPIlocal%iErr)
>>   CALL MPI_COMM_SIZE(MPI_LOCAL_COMM, MPIlocal%nCPU,MPIlocal%iErr) 
>> 
>> openMPI seems not able to create properly MPIlocal%rank.
>> 
>> what should be? a bug?
>> 
>> thanks again
>> 
>> Diego
>> 
>> 
>>> On 3 August 2018 at 19:47, Ralph H Castain  wrote:
>>> Those two command lines look exactly the same to me - what am I missing?
>>> 
>>> 
 On Aug 3, 2018, at 10:23 AM, Diego Avesani  wrote:
 
 Dear all,
 
 I am experiencing a strange error.
 
 In my code I use three group communications:
 MPI_COMM_WORLD
 MPI_MASTERS_COMM
 LOCAL_COMM
 
 which have in common some CPUs.
 
 when I run my code as 
  mpirun -np 4 --oversubscribe ./MPIHyperStrem
 
 I have no problem, while when I run it as
  
  mpirun -np 4 --oversubscribe ./MPIHyperStrem
 
 sometimes it crushes and sometimes not.
 
 It seems that all is linked to 
 CALL MPI_REDUCE(QTS(tstep,:), QTS(tstep,:), nNode, MPI_DOUBLE_PRECISION, 
 MPI_SUM, 0, MPI_LOCAL_COMM, iErr)
 
 which works with in local.
 
 What do you think? Can you please suggestion some debug test?
 Is a problem related to local communications?
 
 Thanks
 
 
 
 Diego
 
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Asynchronous progress in 3.1

2018-08-06 Thread Nathan Hjelm via users

It depends on the interconnect you are using. Some transports have async 
progress support but others do not.

-Nathan

On Aug 06, 2018, at 11:29 AM, "Palmer, Bruce J"  wrote:

Hi,

 

Is there anything that can be done to boost asynchronous progress for MPI RMA 
operations in OpenMPI 3.1? I’m trying to use the MPI RMA runtime in Global 
Arrays and it looks like the performance is pretty bad for some of our tests. 
I’ve seen similar results in other MPI implementations (e.g. Intel MPI) and it 
could be fixed by setting some environment variables to boost asynchronous 
progress. Is there something similar for Open MPI or can you use something like 
Casper?

 

Bruce

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] know which CPU has the maximum value

2018-08-10 Thread Nathan Hjelm via users
The problem is minloc and maxloc need to go away. better to use a custom op. 

> On Aug 10, 2018, at 9:36 AM, George Bosilca  wrote:
> 
> You will need to create a special variable that holds 2 entries, one for the 
> max operation (with whatever type you need) and an int for the rank of the 
> process. The MAXLOC is described on the OMPI man page [1] and you can find an 
> example on how to use it on the MPI Forum [2].
> 
> George.
> 
> 
> [1] https://www.open-mpi.org/doc/v2.0/man3/MPI_Reduce.3.php
> [2] https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node79.html
> 
>> On Fri, Aug 10, 2018 at 11:25 AM Diego Avesani  
>> wrote:
>>  Dear all,
>> I have probably understood.
>> The trick is to use a real vector and to memorize also the rank.
>> 
>> Have I understood correctly?
>> thanks
>> 
>> Diego
>> 
>> 
>>> On 10 August 2018 at 17:19, Diego Avesani  wrote:
>>> Deal all,
>>> I do not understand how MPI_MINLOC works. it seem locate the maximum in a 
>>> vector and not the CPU to which the valur belongs to.
>>> 
>>> @ray: and if two has the same value?
>>> 
>>> thanks 
>>> 
>>> 
>>> Diego
>>> 
>>> 
 On 10 August 2018 at 17:03, Ray Sheppard  wrote:
 As a dumb scientist, I would just bcast the value I get back to the group 
 and ask whoever owns it to kindly reply back with its rank.
  Ray
 
 
> On 8/10/2018 10:49 AM, Reuti wrote:
> Hi,
> 
>> Am 10.08.2018 um 16:39 schrieb Diego Avesani :
>> 
>> Dear all,
>> 
>> I have a problem:
>> In my parallel program each CPU compute a value, let's say eff.
>> 
>> First of all, I would like to know the maximum value. This for me is 
>> quite simple,
>> I apply the following:
>> 
>> CALL MPI_ALLREDUCE(eff, effmaxWorld, 1, MPI_DOUBLE_PRECISION, MPI_MAX, 
>> MPI_MASTER_COMM, MPIworld%iErr)
> Would MPI_MAXLOC be sufficient?
> 
> -- Reuti
> 
> 
>> However, I would like also to know to which CPU that value belongs. Is 
>> it possible?
>> 
>> I have set-up a strange procedure but it works only when all the CPUs 
>> has different values but fails when two of then has the same eff value.
>> 
>> Is there any intrinsic MPI procedure?
>> in anternative,
>> do you have some idea?
>> 
>> really, really thanks.
>> Diego
>> 
>> 
>> Diego
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
 
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
>>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] know which CPU has the maximum value

2018-08-10 Thread Nathan Hjelm via users
They do not fit with the rest of the predefined operations (which operate on a 
single basic type) and can easily be implemented as user defined operations and 
get the same performance. Add to that the fixed number of tuple types and the 
fact that some of them are non-contiguous (MPI_SHORT_INT) plus the terrible 
names. If I could kill them in MPI-4 I would. 

> On Aug 10, 2018, at 9:47 AM, Diego Avesani  wrote:
> 
> Dear all,
> I have just implemented MAXLOC, why should they  go away?
> it seems working pretty well.
> 
> thanks
> 
> Diego
> 
> 
>> On 10 August 2018 at 17:39, Nathan Hjelm via users 
>>  wrote:
>> The problem is minloc and maxloc need to go away. better to use a custom op. 
>> 
>>> On Aug 10, 2018, at 9:36 AM, George Bosilca  wrote:
>>> 
>>> You will need to create a special variable that holds 2 entries, one for 
>>> the max operation (with whatever type you need) and an int for the rank of 
>>> the process. The MAXLOC is described on the OMPI man page [1] and you can 
>>> find an example on how to use it on the MPI Forum [2].
>>> 
>>> George.
>>> 
>>> 
>>> [1] https://www.open-mpi.org/doc/v2.0/man3/MPI_Reduce.3.php
>>> [2] https://www.mpi-forum.org/docs/mpi-1.1/mpi-11-html/node79.html
>>> 
>>>> On Fri, Aug 10, 2018 at 11:25 AM Diego Avesani  
>>>> wrote:
>>>>  Dear all,
>>>> I have probably understood.
>>>> The trick is to use a real vector and to memorize also the rank.
>>>> 
>>>> Have I understood correctly?
>>>> thanks
>>>> 
>>>> Diego
>>>> 
>>>> 
>>>>> On 10 August 2018 at 17:19, Diego Avesani  wrote:
>>>>> Deal all,
>>>>> I do not understand how MPI_MINLOC works. it seem locate the maximum in a 
>>>>> vector and not the CPU to which the valur belongs to.
>>>>> 
>>>>> @ray: and if two has the same value?
>>>>> 
>>>>> thanks 
>>>>> 
>>>>> 
>>>>> Diego
>>>>> 
>>>>> 
>>>>>> On 10 August 2018 at 17:03, Ray Sheppard  wrote:
>>>>>> As a dumb scientist, I would just bcast the value I get back to the 
>>>>>> group and ask whoever owns it to kindly reply back with its rank.
>>>>>>  Ray
>>>>>> 
>>>>>> 
>>>>>>> On 8/10/2018 10:49 AM, Reuti wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>>> Am 10.08.2018 um 16:39 schrieb Diego Avesani :
>>>>>>>> 
>>>>>>>> Dear all,
>>>>>>>> 
>>>>>>>> I have a problem:
>>>>>>>> In my parallel program each CPU compute a value, let's say eff.
>>>>>>>> 
>>>>>>>> First of all, I would like to know the maximum value. This for me is 
>>>>>>>> quite simple,
>>>>>>>> I apply the following:
>>>>>>>> 
>>>>>>>> CALL MPI_ALLREDUCE(eff, effmaxWorld, 1, MPI_DOUBLE_PRECISION, MPI_MAX, 
>>>>>>>> MPI_MASTER_COMM, MPIworld%iErr)
>>>>>>> Would MPI_MAXLOC be sufficient?
>>>>>>> 
>>>>>>> -- Reuti
>>>>>>> 
>>>>>>> 
>>>>>>>> However, I would like also to know to which CPU that value belongs. Is 
>>>>>>>> it possible?
>>>>>>>> 
>>>>>>>> I have set-up a strange procedure but it works only when all the CPUs 
>>>>>>>> has different values but fails when two of then has the same eff value.
>>>>>>>> 
>>>>>>>> Is there any intrinsic MPI procedure?
>>>>>>>> in anternative,
>>>>>>>> do you have some idea?
>>>>>>>> 
>>>>>>>> really, really thanks.
>>>>>>>> Diego
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Diego
>>>>>>>> 
>>>>>>>> ___
>>>>>>>> users mailing list
>>>>>>>> users@lists.open-mpi.org
>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> ___
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>> 
>>>>>> ___
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> 
>>>> 
>>>> ___
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI_MAXLOC problems

2018-08-28 Thread Nathan Hjelm via users


Yup. That is the case for all composed datatype which is what the tuple types 
are. Predefined composed datatypes.

-Nathan

On Aug 28, 2018, at 02:35 PM, "Jeff Squyres (jsquyres) via users" 
 wrote:

I think Gilles is right: remember that datatypes like MPI_2DOUBLE_PRECISION are 
actually 2 values. So if you want to send 1 pair of double precision values 
with MPI_2DOUBLE_PRECISION, then your count is actually 1.


On Aug 22, 2018, at 8:02 AM, Gilles Gouaillardet 
 wrote:

Diego,

Try calling allreduce with count=1

Cheers,

Gilles

On Wednesday, August 22, 2018, Diego Avesani  wrote:
Dear all,

I am going to start again the discussion about MPI_MAXLOC. We had one a couple 
of week before with George, Ray, Nathan, Jeff S, Jeff S., Gus.

This because I have a problem. I have two groups and two communicators.
The first one takes care of compute the maximum vale and to which processor it 
belongs:

nPart = 100

IF(MPI_COMM_NULL .NE. MPI_MASTER_COMM)THEN

CALL MPI_ALLREDUCE( EFFMAX, EFFMAXW, 2, MPI_2DOUBLE_PRECISION, MPI_MAXLOC, 
MPI_MASTER_COMM,MPImaster%iErr )
whosend = INT(EFFMAXW(2))
gpeff = EFFMAXW(1)
CALL MPI_BCAST(whosend,1,MPI_INTEGER,whosend,MPI_MASTER_COMM,MPImaster%iErr)

ENDIF

If I perform this, the program set to zero one variable, specifically nPart.

if I print:

IF(MPI_COMM_NULL .NE. MPI_MASTER_COMM)THEN
WRITE(*,*) MPImaster%rank,nPart
ELSE
WRITE(*,*) MPIlocal%rank,nPart
ENDIF

I get;

1 2
1 2
3 2
3 2
2 2
2 2
1 2
1 2
3 2
3 2
2 2
2 2


1 0
1 0
0 0
0 0

This seems some typical memory allocation problem.

What do you think?

Thanks for any kind of help.




Diego

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] pt2pt osc required for single-node runs?

2018-09-06 Thread Nathan Hjelm via users

You can either move to MPI_Win_allocate or try the v4.0.x snapshots. I will 
look at bringing the btl/vader support for osc/rdma back to v3.1.x. osc/pt2pt 
will probably never become truly thread safe.

-Nathan

On Sep 06, 2018, at 08:34 AM, Joseph Schuchart  wrote:

All,

I installed Open MPI 3.1.2 on my laptop today (up from 3.0.0, which 
worked fine) and ran into the following error when trying to create a 
window:


```
--
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this 
release.

Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
--
[beryl:13894] *** An error occurred in MPI_Win_create
[beryl:13894] *** reported by process [2678849537,0]
[beryl:13894] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[beryl:13894] *** MPI_ERR_WIN: invalid window
[beryl:13894] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:13894] *** and potentially your MPI job)
```

I remember seeing this announced in the release notes. I wonder, 
however, why the pt2pt component is required for a run on a single node 
(as suggested by the error message). I tried to disable the pt2pt 
component, which gives a similar error but without the message about the 
pt2pt component:


```
$ mpirun -n 4 --mca osc ^pt2pt ./a.out
[beryl:13738] *** An error occurred in MPI_Win_create
[beryl:13738] *** reported by process [2621964289,0]
[beryl:13738] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[beryl:13738] *** MPI_ERR_WIN: invalid window
[beryl:13738] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:13738] *** and potentially your MPI job)
```

Is this a known issue with v3.1.2? Is there a way to get more 
information about what is going wrong in the second case. Is this the 
right way to disable the pt2pt component?


Cheers,
Joseph
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [open-mpi/ompi] vader compile issue (#5814)

2018-10-02 Thread Nathan Hjelm via users
hmm. Add

#include 

to the test and try it again.

-Nathan

> On Oct 2, 2018, at 12:41 AM, Siegmar Gross 
>  wrote:
> 
> Hi Jeff, hi Nathan,
> 
> the compilers (Sun C 5.15, Sun C 5.14, Sun C 5.13) don't like the code.
> 
> loki tmp 110 cc -V
> cc: Studio 12.6 Sun C 5.15 Linux_i386 2017/05/30
> loki tmp 111 \cc -std=c11 atomic_test.c
> "atomic_test.c", line 5: warning: no explicit type given
> "atomic_test.c", line 5: syntax error before or at: test
> "atomic_test.c", line 8: undefined symbol: test
> "atomic_test.c", line 8: undefined symbol: x
> cc: acomp failed for atomic_test.c
> loki tmp 112
> 
> 
> loki tmp 111 cc -V
> cc: Studio 12.5 Sun C 5.14 Linux_i386 2016/05/31
> loki tmp 112 \cc -std=c11 atomic_test.c
> "atomic_test.c", line 5: warning: no explicit type given
> "atomic_test.c", line 5: syntax error before or at: test
> "atomic_test.c", line 8: undefined symbol: test
> "atomic_test.c", line 8: undefined symbol: x
> cc: acomp failed for atomic_test.c
> loki tmp 113
> 
> 
> loki tmp 108 cc -V
> cc: Sun C 5.13 Linux_i386 2014/10/20
> loki tmp 109 \cc -std=c11 atomic_test.c
> "atomic_test.c", line 2: cannot find include file: 
> "atomic_test.c", line 5: warning: _Atomic is a keyword in ISO C11
> "atomic_test.c", line 5: undefined symbol: _Atomic
> "atomic_test.c", line 5: syntax error before or at: intptr_t
> "atomic_test.c", line 6: undefined symbol: intptr_t
> "atomic_test.c", line 8: undefined symbol: test
> "atomic_test.c", line 8: undefined symbol: x
> cc: acomp failed for atomic_test.c
> loki tmp 110
> 
> 
> I have attached the file config.log.gist from my master build, although I
> didn't know what the gist is. Let me know if you need something different
> from that file. By the way, I was able to build the upcoming version 4.0.0.
> 
> loki openmpi-v4.0.x-201809290241-a7e275c-Linux.x86_64.64_cc 124 grep Error 
> log.*
> log.make-install.Linux.x86_64.64_cc: /usr/bin/install -c -m 644 
> mpi/man/man3/MPI_Compare_and_swap.3 mpi/man/man3/MPI_Dims_create.3 
> mpi/man/man3/MPI_Dist_graph_create.3 
> mpi/man/man3/MPI_Dist_graph_create_adjacent.3 
> mpi/man/man3/MPI_Dist_graph_neighbors.3 
> mpi/man/man3/MPI_Dist_graph_neighbors_count.3 
> mpi/man/man3/MPI_Errhandler_create.3 mpi/man/man3/MPI_Errhandler_free.3 
> mpi/man/man3/MPI_Errhandler_get.3 mpi/man/man3/MPI_Errhandler_set.3 
> mpi/man/man3/MPI_Error_class.3 mpi/man/man3/MPI_Error_string.3 
> mpi/man/man3/MPI_Exscan.3 mpi/man/man3/MPI_Iexscan.3 
> mpi/man/man3/MPI_Fetch_and_op.3 mpi/man/man3/MPI_File_c2f.3 
> mpi/man/man3/MPI_File_call_errhandler.3 mpi/man/man3/MPI_File_close.3 
> mpi/man/man3/MPI_File_create_errhandler.3 mpi/man/man3/MPI_File_delete.3 
> mpi/man/man3/MPI_File_f2c.3 mpi/man/man3/MPI_File_get_amode.3 
> mpi/man/man3/MPI_File_get_atomicity.3 mpi/man/man3/MPI_File_get_byte_offset.3 
> mpi/man/man3/MPI_File_get_errhandler.3 mpi/man/man3/MPI_File_get_group.3 
> mpi/man/man3/MPI_File_get_info.3 mpi/man/man3/MPI_File_get_position.3 
> mpi/man/man3/MPI_File_get_position_shared.3 mpi/man/man3/MPI_File_get_size.3 
> mpi/man/man3/MPI_File_get_type_extent.3 mpi/man/man3/MPI_File_get_view.3 
> mpi/man/man3/MPI_File_iread.3 mpi/man/man3/MPI_File_iread_at.3 
> mpi/man/man3/MPI_File_iread_all.3 mpi/man/man3/MPI_File_iread_at_all.3 
> mpi/man/man3/MPI_File_iread_shared.3 mpi/man/man3/MPI_File_iwrite.3 
> mpi/man/man3/MPI_File_iwrite_at.3 mpi/man/man3/MPI_File_iwrite_all.3 
> '/usr/local/openmpi-4.0.0_64_cc/share/man/man3'
> log.make.Linux.x86_64.64_cc:  GENERATE mpi/man/man3/MPI_Error_class.3
> log.make.Linux.x86_64.64_cc:  GENERATE mpi/man/man3/MPI_Error_string.3
> loki openmpi-v4.0.x-201809290241-a7e275c-Linux.x86_64.64_cc 125
> 
> 
> 
> Best regards and thank you very much for your help
> 
> Siegmar
> 
> 
>> On 10/01/18 22:07, Jeff Squyres wrote:
>> @siegmargross  Nathan posted a sample 
>> program (via editing his prior comment), so you didn't get the mail about 
>> it. Can you check #5814 (comment) 
>>  and 
>> compile/run the sample program he proposed and see what happens?
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub 
>> , or 
>> mute the thread 
>> .
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [open-mpi/ompi] vader compile issue (#5814)

2018-10-02 Thread Nathan Hjelm via users
Definitely a compiler bug. I opened a PR to work around it and posted a question on the Oracle forums.-NathanOn Oct 02, 2018, at 12:48 AM, Siegmar Gross  wrote:Hi Jeff, hi Nathan,the compilers (Sun C 5.15, Sun C 5.14, Sun C 5.13) don't like the code.loki tmp 110 cc -Vcc: Studio 12.6 Sun C 5.15 Linux_i386 2017/05/30loki tmp 111 \cc -std=c11 atomic_test.c"atomic_test.c", line 5: warning: no explicit type given"atomic_test.c", line 5: syntax error before or at: test"atomic_test.c", line 8: undefined symbol: test"atomic_test.c", line 8: undefined symbol: xcc: acomp failed for atomic_test.cloki tmp 112loki tmp 111 cc -Vcc: Studio 12.5 Sun C 5.14 Linux_i386 2016/05/31loki tmp 112 \cc -std=c11 atomic_test.c"atomic_test.c", line 5: warning: no explicit type given"atomic_test.c", line 5: syntax error before or at: test"atomic_test.c", line 8: undefined symbol: test"atomic_test.c", line 8: undefined symbol: xcc: acomp failed for atomic_test.cloki tmp 113loki tmp 108 cc -Vcc: Sun C 5.13 Linux_i386 2014/10/20loki tmp 109 \cc -std=c11 atomic_test.c"atomic_test.c", line 2: cannot find include file: "atomic_test.c", line 5: warning: _Atomic is a keyword in ISO C11"atomic_test.c", line 5: undefined symbol: _Atomic"atomic_test.c", line 5: syntax error before or at: intptr_t"atomic_test.c", line 6: undefined symbol: intptr_t"atomic_test.c", line 8: undefined symbol: test"atomic_test.c", line 8: undefined symbol: xcc: acomp failed for atomic_test.cloki tmp 110I have attached the file config.log.gist from my master build, although Ididn't know what the gist is. Let me know if you need something differentfrom that file. By the way, I was able to build the upcoming version 4.0.0.loki openmpi-v4.0.x-201809290241-a7e275c-Linux.x86_64.64_cc 124 grep Error log.*log.make-install.Linux.x86_64.64_cc: /usr/bin/install -c -m 644 mpi/man/man3/MPI_Compare_and_swap.3 mpi/man/man3/MPI_Dims_create.3 mpi/man/man3/MPI_Dist_graph_create.3 mpi/man/man3/MPI_Dist_graph_create_adjacent.3 mpi/man/man3/MPI_Dist_graph_neighbors.3 mpi/man/man3/MPI_Dist_graph_neighbors_count.3 mpi/man/man3/MPI_Errhandler_create.3 mpi/man/man3/MPI_Errhandler_free.3 mpi/man/man3/MPI_Errhandler_get.3 mpi/man/man3/MPI_Errhandler_set.3 mpi/man/man3/MPI_Error_class.3 mpi/man/man3/MPI_Error_string.3 mpi/man/man3/MPI_Exscan.3 mpi/man/man3/MPI_Iexscan.3 mpi/man/man3/MPI_Fetch_and_op.3 mpi/man/man3/MPI_File_c2f.3 mpi/man/man3/MPI_File_call_errhandler.3 mpi/man/man3/MPI_File_close.3 mpi/man/man3/MPI_File_create_errhandler.3 mpi/man/man3/MPI_File_delete.3 mpi/man/man3/MPI_File_f2c.3 mpi/man/man3/MPI_File_get_amode.3 mpi/man/man3/MPI_File_get_atomicity.3 mpi/man/man3/MPI_File_get_byte_offset.3 mpi/man/man3/MPI_File_get_errhandler.3 mpi/man/man3/MPI_File_get_group.3 mpi/man/man3/MPI_File_get_info.3 mpi/man/man3/MPI_File_get_position.3 mpi/man/man3/MPI_File_get_position_shared.3 mpi/man/man3/MPI_File_get_size.3 mpi/man/man3/MPI_File_get_type_extent.3 mpi/man/man3/MPI_File_get_view.3 mpi/man/man3/MPI_File_iread.3 mpi/man/man3/MPI_File_iread_at.3 mpi/man/man3/MPI_File_iread_all.3 mpi/man/man3/MPI_File_iread_at_all.3 mpi/man/man3/MPI_File_iread_shared.3 mpi/man/man3/MPI_File_iwrite.3 mpi/man/man3/MPI_File_iwrite_at.3 mpi/man/man3/MPI_File_iwrite_all.3 '/usr/local/openmpi-4.0.0_64_cc/share/man/man3'log.make.Linux.x86_64.64_cc: GENERATE mpi/man/man3/MPI_Error_class.3log.make.Linux.x86_64.64_cc: GENERATE mpi/man/man3/MPI_Error_string.3loki openmpi-v4.0.x-201809290241-a7e275c-Linux.x86_64.64_cc 125Best regards and thank you very much for your helpSiegmarOn 10/01/18 22:07, Jeff Squyres wrote:@siegmargross  Nathan posted a sample program(via editing his prior comment), so you didn't get the mail about it. Can youcheck #5814 (comment) andcompile/run the sample program he proposed and see what happens?—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mutethe thread.___users mailing listusers@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/usersThis file contains any messages produced by compilers while
running configure, to aid debugging if configure makes a mistake.

It was created by Open MPI configure master-201809290304-73075b8, which was
generated by GNU Autoconf 2.69.  Invocation command line was

  $ ../openmpi-master-201809290304-73075b8/configure 
--prefix=/usr/local/openmpi-master_64_cc 
--libdir=/usr/local/openmpi-master_64_cc/lib64 
--with-jdk-bindir=/usr/local/jdk-10.0.1/bin 
--with-jdk-headers=/usr/local/jdk-10.0.1/include 
JAVA_HOME=/usr/local/jdk-10.0.1 LDFLAGS=-m64 -mt -Wl,-z -Wl,noexecstack 
-L/usr/local/lib64 CC=cc CXX=CC FC=f95 CFLAGS=-m64 -mt CXXFLAGS=-m64 
FCFLAGS=-m64 CPP=cpp CXXCPP=cpp --disab

Re: [OMPI users] [version 2.1.5] invalid memory reference

2018-10-11 Thread Nathan Hjelm via users

Those features (MPI_LB/MPI_UB/MPI_Type_struct) were removed in MPI-3.0. It is 
fairly straightforward to update the code to be MPI-3.0 compliant.

MPI_Type_struct -> MPI_Type_create_struct

MPI_LB/MPI_UB -> MPI_Type_create_resized

Example:

types[0] = MPI_LB;
disp[0] = my_lb;
lens[0] = 1;
types[1] = MPI_INt;
disp[1] = disp1;
lens[1] = count;
types[2] = MPI_UB;
disp[2] = my_ub;
lens[2] = 1;

MPI_Type_struct (3, lens, disp, types, &new_type);


becomes:

types[0] = MPI_INt;
disp[0] = disp1;
lens[0] = count;

MPI_Type_create_struct (1, lens, disp, types, &tmp_type);
MPI_Type_create_resized (tmp_type, my_lb, my_ub, &new_type);
MPI_Type_free (&tmp_type);


-Nathan

On Oct 11, 2018, at 09:00 AM, Patrick Begou 
 wrote:

Hi Jeff and George

thanks for your answer. I find some time to work again on this problem dans I 
have downloaded OpenMPI 4.0.0rc4. It compiles without any problem but building 
the first dependance of my code (hdf5 1.8.12) with this version 4 fails:


../../src/H5Smpio.c:355:28: error: 'MPI_LB' undeclared (first use in this 
function); did you mean 'MPI_IO'?

 old_types[0] = MPI_LB;
    ^~
    MPI_IO
../../src/H5Smpio.c:355:28: note: each undeclared identifier is reported only 
once for each function it appears in
../../src/H5Smpio.c:357:28: error: 'MPI_UB' undeclared (first use in this 
function); did you mean 'MPI_LB'?

 old_types[2] = MPI_UB;
    ^~
    MPI_LB
../../src/H5Smpio.c:365:24: warning: implicit declaration of function 
'MPI_Type_struct'; did you mean 'MPI_Type_size_x'? [-Wimplicit-function-declaration]

 mpi_code = MPI_Type_struct(3,   /* count */
    ^~~
    MPI_Type_size_x

It is not possible for me to use a more recent hdf5 version as the API as 
changed and will not work with the code, even in compatible mode.


At this time, I'll try version 3 from the git repo if I have the required tools 
available on my server. All prerequisites compile successfully with 3.1.2.


Patrick

--
===
| Equipe M.O.S.T. | |
| Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr |
| LEGI | |
| BP 53 X | Tel 04 76 82 51 35 |
| 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 |
===

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Nathan Hjelm via users



All of this is completely expected. Due to the requirements of the standard it 
is difficult to make use of network atomics even for MPI_Compare_and_swap 
(MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true


Shared lock is slower than an exclusive lock because there is an extra lock 
step as part of the accumulate (it isn't needed if there is an exclusive lock). 
When setting the above parameter you are telling the implementation that you 
will only be using a single count and we can optimize that with the hardware. 
The RMA working group is working on an info key that will essentially do the 
same thing.


Note the above parameter won't help you with IB if you are using UCX unless you 
set this (master only right now):



btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx




Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan



On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users
Quick scan of the program and it looks ok to me. I will dig deeper and see if I can determine the underlying cause.What Open MPI version are you using?-NathanOn Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:While using the mca parameter in a real application I noticed a strange effect, which took me a while to figure out: It appears that on the Aries network the accumulate operations are not atomic anymore. I am attaching a test program that shows the problem: all but one processes continuously increment a counter while rank 0 is continuously subtracting a large value and adding it again, eventually checking for the correct number of increments. Without the mca parameter the test at the end succeeds as all increments are accounted for:```$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:15000```When setting the mca parameter the test fails with garbage in the result:```$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 ./mpi_fetch_op_local_remoteresult:25769849013mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: Assertion `sum == 1000*(comm_size-1)' failed.```All processes perform only MPI_Fetch_and_op in combination with MPI_SUM so I assume that the test in combination with the mca flag is correct. I cannot reproduce this issue on our IB cluster.Is that an issue in Open MPI or is there some problem in the test case that I am missing?Thanks in advance,JosephOn 11/6/18 1:15 PM, Joseph Schuchart wrote:Thanks a lot for the quick reply, settingosc_rdma_acc_single_intrinsic=true does the trick for both shared andexclusive locks and brings it down to <2us per operation. I hope thatthe info key will make it into the next version of the standard, Icertainly have use for it :)Cheers,JosephOn 11/6/18 12:13 PM, Nathan Hjelm via users wrote:All of this is completely expected. Due to the requirements of thestandard it is difficult to make use of network atomics even forMPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil theparty). If you want MPI_Fetch_and_op to be fast set this MCA parameter:osc_rdma_acc_single_intrinsic=trueShared lock is slower than an exclusive lock because there is an extralock step as part of the accumulate (it isn't needed if there is anexclusive lock). When setting the above parameter you are telling theimplementation that you will only be using a single count and we canoptimize that with the hardware. The RMA working group is working onan info key that will essentially do the same thing.Note the above parameter won't help you with IB if you are using UCXunless you set this (master only right now):btl_uct_transports=dc_mlx5btl=self,vader,uctosc=^ucxThough there may be a way to get osc/ucx to enable the same sort ofoptimization. I don't know.-NathanOn Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:All,I am currently experimenting with MPI atomic operations and wanted toshare some interesting results I am observing. The numbers below aremeasurements from both an IB-based cluster and our Cray XC40. Thebenchmarks look like the following snippet:```if (rank == 1) {uint64_t res, val;for (size_t i = 0; i < NUM_REPS; ++i) {MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);MPI_Win_flush(target, win);}}MPI_Barrier(MPI_COMM_WORLD);```Only rank 1 performs atomic operations, rank 0 waits in a barrier (Ihave tried to confirm that the operations are done in hardware byletting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless:* Linux Cluster, IB QDR *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 4.323384uscompare_exchange: 2.035905usaccumulate: 4.326358usget_accumulate: 4.334831usExclusive lock, MPI_UINT64_T:fetch_op: 2.438080uscompare_exchange: 2.398836usaccumulate: 2.435378usget_accumulate: 2.448347usShared lock, MPI_UINT32_T:fetch_op: 6.819977uscompare_exchange: 4.551417usaccumulate: 6.807766usget_accumulate: 6.817602usShared lock, MPI_UINT64_T:fetch_op: 4.954860uscompare_exchange: 2.399373usaccumulate: 4.965702usget_accumulate: 4.977876usThere are two interesting observations:a) operations on 64bit operands generally seem to have lower latenciesthan operations on 32bitb) Using an exclusive lock leads to lower latenciesOverall, there is a factor of almost 3 between SharedLock+uint32_t andExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate(compare_exchange seems to be somewhat of an outlier).* Cray XC40, Aries *average of 10 iterationsExclusive lock, MPI_UINT32_T:fetch_op: 2.011794uscompare_exchange: 1.740825usaccumulate: 1.795500usget_accumulate: 1.985409usExclusive lock, MPI_UINT64_T:fetch_op: 2.017172uscompare_exchange: 1.846202usaccumulate: 1.812578usget_accumulate: 2.005541usShared lock, MPI_UINT32_T:fetch_op: 5.380455uscompare_exchange: 5.164458usaccumulate: 5.230184usget_accumulate: 5.399722usShared lock, MPI_UINT64_T:fe

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Nathan Hjelm via users

Ok, then it sounds like a regression. I will try to track it down today or 
tomorrow.


-Nathan

On Nov 08, 2018, at 01:41 PM, Joseph Schuchart  wrote:


Sorry for the delay, I wanted to make sure that I test the same version
on both Aries and IB: git master bbe5da4. I realized that I had
previously tested with 3.1.3 on the IB cluster, which ran fine. If I use
the same version I run into the same problem on both systems (with --mca
btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I
have not tried using UCX for this.

Joseph

On 11/8/18 1:20 PM, Nathan Hjelm via users wrote:

Quick scan of the program and it looks ok to me. I will dig deeper and
see if I can determine the underlying cause.


What Open MPI version are you using?


-Nathan


On Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:


While using the mca parameter in a real application I noticed a strange
effect, which took me a while to figure out: It appears that on the
Aries network the accumulate operations are not atomic anymore. I am
attaching a test program that shows the problem: all but one processes
continuously increment a counter while rank 0 is continuously
subtracting a large value and adding it again, eventually checking for
the correct number of increments. Without the mca parameter the test at
the end succeeds as all increments are accounted for:


```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```


When setting the mca parameter the test fails with garbage in the result:


```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1
./mpi_fetch_op_local_remote
result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:
Assertion `sum == 1000*(comm_size-1)' failed.
```


All processes perform only MPI_Fetch_and_op in combination with MPI_SUM
so I assume that the test in combination with the mca flag is correct. I
cannot reproduce this issue on our IB cluster.


Is that an issue in Open MPI or is there some problem in the test case
that I am missing?


Thanks in advance,
Joseph




On 11/6/18 1:15 PM, Joseph Schuchart wrote:
Thanks a lot for the quick reply, setting
osc_rdma_acc_single_intrinsic=true does the trick for both shared and
exclusive locks and brings it down to <2us per operation. I hope that
the info key will make it into the next version of the standard, I
certainly have use for it :)


Cheers,
Joseph


On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the
standard it is difficult to make use of network atomics even for
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true


Shared lock is slower than an exclusive lock because there is an extra
lock step as part of the accumulate (it isn't needed if there is an
exclusive lock). When setting the above parameter you are telling the
implementation that you will only be using a single count and we can
optimize that with the hardware. The RMA working group is working on
an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX
unless you set this (master only right now):


btl_uct_transports=dc_mlx5


btl=self,vader,uct


osc=^ucx




Though there may be a way to get osc/ucx to enable the same sort of
optimization. I don't know.




-Nathan




On Nov 06, 2018, at 09:38 AM, Joseph Schuchart 
wrote:


All,


I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:


```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```


Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but
I am
including other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 10 iterations


Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us


Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us


Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us


Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us


There are two interesting observations:
a) operations on 64bit operands gene

Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released

2018-11-14 Thread Nathan Hjelm via users
I really need to update that wording. It has been awhile and the code seems to 
have stabilized. It’s quite safe to use and supports some of the latest kernel 
versions.

-Nathan

> On Nov 13, 2018, at 11:06 PM, Bert Wesarg via users 
>  wrote:
> 
> Dear Takahiro,
> On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
>  wrote:
>> 
>> XPMEM moved to GitLab.
>> 
>> https://gitlab.com/hjelmn/xpmem
> 
> the first words from the README aren't very pleasant to read:
> 
> This is an experimental version of XPMEM based on a version provided by
> Cray and uploaded to https://code.google.com/p/xpmem. This version supports
> any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
> may cause kernel panics, code crashes, eat your cat, etc.*
> 
> Installing this on my laptop where I just want developing with SHMEM
> it would be a pitty to lose work just because of that.
> 
> Best,
> Bert
> 
>> 
>> Thanks,
>> Takahiro Kawashima,
>> Fujitsu
>> 
>>> Hello Bert,
>>> 
>>> What OS are you running on your notebook?
>>> 
>>> If you are running Linux, and you have root access to your system,  then
>>> you should be able to resolve the Open SHMEM support issue by installing
>>> the XPMEM device driver on your system, and rebuilding UCX so it picks
>>> up XPMEM support.
>>> 
>>> The source code is on GitHub:
>>> 
>>> https://github.com/hjelmn/xpmem
>>> 
>>> Some instructions on how to build the xpmem device driver are at
>>> 
>>> https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
>>> 
>>> You will need to install the kernel source and symbols rpms on your
>>> system before building the xpmem device driver.
>>> 
>>> Hope this helps,
>>> 
>>> Howard
>>> 
>>> 
>>> Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
>>> users@lists.open-mpi.org>:
>>> 
 Hi,
 
 On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
  wrote:
> 
> The Open MPI Team, representing a consortium of research, academic, and
> industry partners, is pleased to announce the release of Open MPI version
> 4.0.0.
> 
> v4.0.0 is the start of a new release series for Open MPI.  Starting with
> this release, the OpenIB BTL supports only iWarp and RoCE by default.
> Starting with this release,  UCX is the preferred transport protocol
> for Infiniband interconnects. The embedded PMIx runtime has been updated
> to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
> release is ABI compatible with the 3.x release streams. There have been
 numerous
> other bug fixes and performance improvements.
> 
> Note that starting with Open MPI v4.0.0, prototypes for several
> MPI-1 symbols that were deleted in the MPI-3.0 specification
> (which was published in 2012) are no longer available by default in
> mpi.h. See the README for further details.
> 
> Version 4.0.0 can be downloaded from the main Open MPI web site:
> 
>  https://www.open-mpi.org/software/ompi/v4.0/
> 
> 
> 4.0.0 -- September, 2018
> 
> 
> - OSHMEM updated to the OpenSHMEM 1.4 API.
> - Do not build OpenSHMEM layer when there are no SPMLs available.
>  Currently, this means the OpenSHMEM layer will only build if
>  a MXM or UCX library is found.
 
 so what is the most convenience way to get SHMEM working on a single
 shared memory node (aka. notebook)? I just realized that I don't have
 a SHMEM since Open MPI 3.0. But building with UCX does not help
 either. I tried with UCX 1.4 but Open MPI SHMEM
 still does not work:
 
 $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
 $ oshrun -np 2 ./shmem_hello_world-4.0.0
 [1542109710.217344] [tudtug:27715:0] select.c:406  UCX  ERROR
 no remote registered memory access transport to tudtug:27716:
 self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
 tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
 mm/posix - Destination is unreachable, cma/cma - no put short
 [1542109710.217344] [tudtug:27716:0] select.c:406  UCX  ERROR
 no remote registered memory access transport to tudtug:27715:
 self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
 tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
 mm/posix - Destination is unreachable, cma/cma - no put short
 [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
 Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
 [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
 Error: add procs FAILED rc=-2
 [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
 Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
 [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
 Error: add procs FAILED rc=-2
 ---

Re: [OMPI users] Hang in mpi on 32-bit

2018-11-26 Thread Nathan Hjelm via users
Can you try configuring with —disable-builtin-atomics and see if that fixes the 
issue for you?

-Nathan

> On Nov 26, 2018, at 9:11 PM, Orion Poplawski  wrote:
> 
> Hello -
> 
>  We are starting to see some mpi processes "hang" (really cpu spin and never 
> complete) on 32 bit architectures on Fedora during package tests. Some 
> examples:
> 
> hpl 2.2 and openmpi 2.1.5 on i686 and arm:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=31129461
> 
> hdf5 1.8.20 and openmpi 3.1.3 on i686 with the "t_cache" test.
> 
> https://copr-be.cloud.fedoraproject.org/results/@scitech/openmpi3.1/fedora-28-i386/00830432-hdf5/builder-live.log
> 
> I'm at a loss as to how to debug this further.
> 
> 
> 
> -- 
> Orion Poplawski
> Manager of NWRA Technical Systems  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301 https://www.nwra.com/
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-20 Thread Nathan Hjelm via users
How many nodes are you using? How many processes per node? What kind of 
processor? Open MPI version? 25 GB is several orders of magnitude more memory 
than should be used except at extreme scale (1M+ processes). Also, how are you 
calculating memory usage?

-Nathan

> On Dec 20, 2018, at 4:49 AM, Adam Sylvester  wrote:
> 
> Is there a way at runtime to query OpenMPI to ask it how much memory it's 
> using for internal buffers?  Is there a way at runtime to set a max amount of 
> memory OpenMPI will use for these buffers?  I have an application where for 
> certain inputs OpenMPI appears to be allocating ~25 GB and I'm not accounting 
> for this in my memory calculations (and thus bricking the machine).
> 
> Thanks.
> -Adam
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Increasing OpenMPI RMA win attach region count.

2019-01-09 Thread Nathan Hjelm via users
If you need to support more attachments you can set the value of that variable 
either by setting:

Environment:

OMPI_MCA_osc_rdma_max_attach


mpirun command line:

—mca osc_rdma_max_attach


Keep in mind that each attachment may use an underlying hardware resource that 
may be easy to exhaust (hence the low default limit). It is recommended to keep 
the total number as small as possible.

-Nathan

> On Jan 8, 2019, at 9:36 PM, Udayanga Wickramasinghe  wrote:
> 
> Hi,
> I am running into an issue in open-mpi where it crashes abruptly during 
> MPI_WIN_ATTACH. 
> [nid00307:25463] *** An error occurred in MPI_Win_attach
> [nid00307:25463] *** reported by process [140736284524545,140728898420736]
> [nid00307:25463] *** on win rdma window 3
> [nid00307:25463] *** MPI_ERR_RMA_ATTACH: Could not attach RMA segment
> [nid00307:25463] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [nid00307:25463] ***and potentially your MPI job)
> 
> Looking more into this issue, it seems like open-mpi has a restriction on the 
> maximum number of segments attached to 32. (OpenMpi3.0 spec doesn't spec 
> doesn't say a lot about this scenario --"The argument win must be a window 
> that was created with MPI_WIN_CREATE_DYNAMIC. Multiple (but nonoverlapping) 
> memory regions may be attached to the same window")
> 
> To workaround this, I have temporarily modified the variable 
> mca_osc_rdma_component.max_attach. Is there any way to configure this in 
> open-mpi?
> 
> Thanks
> Udayanga
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?

2019-01-18 Thread Nathan Hjelm via users

Since neither bcast nor reduce acts as a barrier it is possible to run out of 
resources if either of these calls (or both) are used in a tight loop. The sync 
coll component exists for this scenario. You can enable it by  adding the 
following to mpirun (or setting these variables through the environment or a 
file):

—mca coll_sync_priority 100 —mca coll_sync_barrier_after 10


This will effectively throttle the collective calls for you. You can also 
change the reduce to an allreduce.


-Nathan

> On Jan 18, 2019, at 6:31 PM, Jeff Wentworth via users 
>  wrote:
> 
> Greetings everyone,
> 
> I have a scientific code using Open MPI (v3.1.3) that seems to work fine when 
> MPI_Bcast() and MPI_Reduce() calls are well spaced out in time.  Yet if the 
> time between these calls is short, eventually one of the nodes hangs at some 
> random point, never returning from the broadcast or reduce call.  Is there 
> some minimum time between calls that needs to be obeyed in order for Open MPI 
> to process these reliably?
> 
> The reason this has come up is because I am trying to run in a multi-node 
> environment some established acceptance tests in order to verify that the 
> Open MPI configured version of the code yields the same baseline result as 
> the original single node version of the code.  These acceptance tests must 
> pass in order for the code to be considered validated and deliverable to the 
> customer.  One of these acceptance tests that hangs does involve 90 
> broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), 
> as in:
> 
> Broadcast #89 in
>  Broadcast #89 out 8 bytes
>  Calculate angle #89
>  Reduce #89 in
>  Reduce #89 out 208 bytes
> Write result #89 to file on service node
> Broadcast #90 in
>  Broadcast #90 out 8 bytes
>  Calculate angle #89
>  Reduce #90 in
>  Reduce #90 out 208 bytes
> Write result #90 to file on service node
> 
> If I slow down the above acceptance test, for example by running it under 
> valgrind, then it runs to completion and yields the correct result.  So it 
> seems to suggest that something internal to Open MPI is getting swamped.  I 
> understand that these acceptance tests might be pushing the limit, given that 
> they involve so many short calculations combined with frequent, yet tiny, 
> transfers of data among nodes.  
> 
> Would it be worthwhile for me to enforce with some minimum wait time between 
> the MPI calls, say 0.01 or 0.001 sec via nanosleep()?  The only time it would 
> matter would be when acceptance tests are run, as the situation doesn't arise 
> when beefier runs are performed. 
> 
> Thanks.
> 
> jw2002
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Segfault with OpenMPI 4 and dynamic window

2019-02-16 Thread Nathan Hjelm via users
Probably not. I think this is now fixed. Might be worth trying master to 
verify. 

> On Feb 16, 2019, at 7:01 AM, Bart Janssens  wrote:
> 
> Hi Gilles,
> 
> Thanks, that works (I had to put quotes around the ^rdma). Should I file a 
> github issue?
> 
> Cheers,
> 
> Bart
>> On 16 Feb 2019, 14:05 +0100, Gilles Gouaillardet 
>> , wrote:
>> Bart,
>> 
>> It looks like a bug that involves the osc/rdma component.
>> 
>> Meanwhile, you can
>> mpirun --mca osc ^rdma ...
>> 
>> Cheers,
>> 
>> Gilles
>> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Best way to send on mpi c, architecture dependent data type

2019-03-14 Thread Nathan Hjelm via users
Why not just use C99 stdint? That gives you fixes-size types.

-Nathan

> On Mar 14, 2019, at 9:38 AM, George Reeke  wrote:
> 
> On Wed, 2019-03-13 at 22:10 +, Sergio None wrote:
>> Hello.
>> 
>> 
>> I'm using OpenMPI 3.1.3 on x64 CPU  and two ARMv8( Raspberry pi 3).
>> 
>> 
>> But i'm having some issues with data types that are architecture
>> dependent, like 'long'.
>> 
> -trimmed-
> 
> 
>> 
>> So my question is: there any way to pass data that don't depend of
>> architecture? 
> 
> What I do is make a header file I call 'sysdef.h' with #ifdefs for
> all the systems I use that define types like (for signed 32-bit integer)
> #ifdef Intel64
> typedef int si32
> #endif
> #ifdef ARMv8
> typedef long si32
> #endif
> ...etc...
> [And I have a whole bunch of other useful definitions like MAX_SI32,
> LONG_SIZE and stuff like that--above is not an actual code excerpt]
> 
> and then in the makefile put '-DIntel64' or '-DARMv8' or whatever
> I called it in the sysdef.  Then the code should use the typedef names.
> In the MPI_Send, MPI_Recv calls I usually call the type MPI_BYTE and
> give the actual lengths in bytes which I compute once at the time
> of the malloc and store in a global common block.
> George Reeke
> 
> 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] error "unacceptable operand for unary &" for openmpi-master-201903260242-dfbc144 on Linux with Sun C

2019-03-26 Thread Nathan Hjelm via users
This really looks like a compiler bug. There is no & @ osc_pt2pt.h line 579. 
There is one at line 577 but there is no “unacceptable operand” on that line. 
If I have time this week I will try to find a workaround but it might be worth 
filing a bug with Oracle and see what they say.

-Nathan

> On Mar 26, 2019, at 3:55 AM, Siegmar Gross 
>  wrote:
> 
> Hi,
> 
> I've tried to install openmpi-master-201903260242-dfbc144 on my "SUSE Linux
> Enterprise Server 12.3 with Sun C 5.15 (Oracle Developer Studio 12.6).
> Unfortunately, I still get the following error that I reported some time
> ago: https://github.com/open-mpi/ompi/issues/6180.
> I'm able to build openmpi-v4.0.x-201903220241-97aa434 with the compiler.
> 
> 
> ...
>  CC   osc_pt2pt_request.lo
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_comm.c",
>  line 80: cannot recover from previous errors
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_comm.c
> Makefile:1864: recipe for target 'osc_pt2pt_comm.lo' failed
> make[2]: *** [osc_pt2pt_comm.lo] Error 1
> make[2]: *** Waiting for unfinished jobs
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_module.c
> Makefile:1864: recipe for target 'osc_pt2pt_module.lo' failed
> make[2]: *** [osc_pt2pt_module.lo] Error 1
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_component.c
> Makefile:1864: recipe for target 'osc_pt2pt_component.lo' failed
> make[2]: *** [osc_pt2pt_component.lo] Error 1
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_request.c
> "../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt.h",
>  line 579: unacceptable operand for unary &
> Makefile:1864: recipe for target 'osc_pt2pt_request.lo' failed
> make[2]: *** [osc_pt2pt_request.lo] Error 1
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_data_move.c
> Makefile:1864: recipe for target 'osc_pt2pt_data_move.lo' failed
> make[2]: *** [osc_pt2pt_data_move.lo] Error 1
> cc: acomp failed for 
> ../../../../../openmpi-master-201903260242-dfbc144/ompi/mca/osc/pt2pt/osc_pt2pt_frag.c
> Makefile:1864: recipe for target 'osc_pt2pt_frag.lo' failed
> make[2]: *** [osc_pt2pt_frag.lo] Error 1
> make[2]: Leaving directory 
> '/export2/src/openmpi-master/openmpi-master-201903260242-dfbc144-Linux.x86_64.64_cc/ompi/mca/osc/pt2pt'
> Makefile:3470: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> '/export2/src/openmpi-master/openmpi-master-201903260242-dfbc144-Linux.x86_64.64_cc/ompi'
> Makefile:1855: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
> 
> 
> I would be grateful, if somebody can fix the problem. Do you need anything
> else? Thank you very much for any help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Using strace with Open MPI on Cray

2019-03-30 Thread Nathan Hjelm via users
Add --mca btl ^tcp to your mpirun command line. It shouldn't be used on a Cray.

> On Mar 30, 2019, at 2:00 PM, Christoph Niethammer  wrote:
> 
> Short update:
> 
> The polled file descriptor is related to a socket, which I identified to be 
> the local tcp btl connection ...
> On a Lustre file system the problem does not show up.
> 
> Best
> Christoph
> 
> - Mensaje original -
> De: "niethammer" 
> Para: "Open MPI Users" 
> Enviados: Sábado, 30 de Marzo 2019 10:25:49
> Asunto: [OMPI users] Using strace with Open MPI on Cray
> 
> Hello,
> 
> I was trying to investigate some processes with strace under Open MPI.
> However I have some issues when MPI I/O functionality is included writing 
> data to a NFS file system.
> 
> mpirun -np 2 strace -f ./hello-world mpi-io
> 
> does not return and strace is stuck reporting infinite "poll" calls.
> However, the program works fine without strace.
> 
> I tried with Open MPI 3.x and 4.0.1 switching between ompi and romio on 
> different operating systems (CentOS 7.6, SLES 12).
> 
> I'd appreciate any hints which help me to understand what is going on.
> 
> Best
> Christoph
> 
> --
> 
> Christoph Niethammer
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstrasse 19
> 70569 Stuttgart
> 
> Tel: ++49(0)711-685-87203
> email: nietham...@hlrs.de
> http://www.hlrs.de/people/niethammer
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpi_comm_dup + mpi_comm_group Issue

2019-04-02 Thread Nathan Hjelm via users
That is perfectly valid. The MPI processes that make up the group are all part 
of comm world. I would file a bug with Intel MPI.

-Nathan

> On Apr 2, 2019, at 7:11 AM, Stella Paronuzzi  
> wrote:
> 
>  Good afternoon, I am attaching a simple fortran code that:
> calls the MPI_INIT
> duplicates the global communicator, via MPI_COMM_DUP
> creates a group with half of processes of the global communicator, via 
> MPI_COMM_GROUP
> finally from this group creates a new communicator via MPI_COMM_CREATE_GROUP
> This last call is done using as a first argument MPI_COMM_WORLD instead of 
> the duplicated communicator used for creating the group.
> With the intel MPI (ifort version 19.0.2.187, but also older) this generates 
> a seg fault, while using openMPI it doesn't.
> 
> Uncommenting line 18 and commenting line 19 works fine with both.
> Now, looking at the documentation, it seems like this code shouldn't work 
> after all, because the 
> 
> "MPI_COMM_DUP creates a new communicator over the same group as comm but with 
> a new context"
> 
> and so it should't be possible to use the MPI_COMM_WORLD as a first argument 
> of the create MPI_COMM_CREATE_GROUP.
> 
> What do you think? In the beginning I was thinking about a bug in the intel 
> MPI, but now I'm simply a bit confused ... 
> 
> Thank you in advance for the help
> 
> Stella
> 
> 
> 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues compiling HPL with OMPIv4.0.0

2019-04-03 Thread Nathan Hjelm via users
Giles is correct. If mpicc is showing errors like those in your original email 
then it is not invoking a C compiler. C does not have any concept of try or 
catch. No modern C compiler will complain about a variable named “try” as it is 
not a reserved keyword in the C language.

Example:

foo.c:

int try = 0;

gcc --std=c11 -c foo.c

No error


g++ -c foo.c  
foo.c:3:5: error: expected unqualified-id
int try = 0;
^
1 error generated.

-Nathan

> On Apr 3, 2019, at 6:09 PM, Gilles Gouaillardet  wrote:
> 
> Do not get fooled by the symlinks to opal_wrapper !
> 
> opal_wrapper checks how it is invoked (e.g. check argv[0] in main()) and the 
> behavior is different
> 
> if it is invoked as mpicc, mpiCC, mpifort and other
> 
> 
> If the error persists with mpicc, you can manually extract the mpicc command 
> line, and manually run it with the -showme parameter,
> 
> it will show you the full command line (and who knows, mpicc might invoke a 
> C++ compiler after all, and that would be a config issue)
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> On 4/4/2019 7:48 AM, afernan...@odyhpc.com wrote:
>> 
>> Sam and Jeff,
>> 
>> Thank you for your answers. My first attempts actually used mpicc rather 
>> than mpiCC, switching to mpiCC was simply to check out if the problem 
>> persisted. I noticed that both mpicc and mpiCC are linked to the same file 
>> (opal_wrapper) and didn't bother switching it back. I'm not sure if the 
>> wrapper figures out what compiler you call because I was getting the same 
>> error message. Jeff is right pointing out that 'try' is reserved but the 
>> original file seems to be really old (think 1970). Apparently, the new 
>> compiler (shipped with OMPIv4) is more sensitive and beeps when the older 
>> didn't.
>> 
>> Thanks again,
>> 
>> AFernandez
>> 
>> Indeed, you cannot use "try" as a variable name in C++ because it is a 
>> https://en.cppreference.com/w/cpp/keyword.
>> 
>> As already suggested, use a C compiler, or you can replace "try" with "xtry" 
>> or any other non-reserved word.
>> 
>> Jeff
>> 
>> On Wed, Apr 3, 2019 at 1:41 PM Gutierrez, Samuel K. via users 
>> mailto:users@lists.open-mpi.org>> wrote:
>> 
>>Hi,
>> 
>>It looks like you are using the C++ wrapper compiler (mpiCC)
>>instead of the C wrapper compiler (mpicc). Perhaps using mpicc
>>instead of mpiCC will resolve your issue.
>> 
>>Best,
>> 
>>Sam
>> 
>> 
>> 
>>On Apr 3, 2019, at 12:38 PM, afernan...@odyhpc.com
>> wrote:
>> 
>>Hello,
>> 
>>I'm trying to compile HPL(v2.3) with OpenBLAS and OMPI. The
>>compilation succeeds when using the old OMPI (v1.10.8) but
>>fails with OMPI v4.0.0 (I'm still not using v4.0.1). The error
>>is for an old subroutine that determines machine-specific
>>arithmetic constants:
>> 
>>mpiCC -o HPL_dlamch.o -c
>>-I/home/centos/benchmarks/hpl-2.2/include
>>-I/home/centos/benchmarks/hpl-2.2/include/impetus03
>>-I/opt/openmpi/include  ../HPL_dlamch.c
>> 
>>../HPL_dlamch.c: In function ‘void HPL_dlamc5(int, int, int,
>>int, int*, double*)’:
>> 
>>../HPL_dlamch.c:749:67: error: expected unqualified-id before
>>‘try’
>> 
>>intexbits=1, expsum, i, lexp=1, nbits,
>>try,
>> 
>>^
>> 
>>../HPL_dlamch.c:761:8: error: expected ‘{’ before ‘=’ token
>> 
>>try = (int)( (unsigned int)(lexp) << 1 );
>> 
>>^
>> 
>>../HPL_dlamch.c:761:8: error: expected ‘catch’ before ‘=’ token
>> 
>>../HPL_dlamch.c:761:8: error: expected ‘(’ before ‘=’ token
>> 
>>../HPL_dlamch.c:761:8: error: expected type-specifier before
>>‘=’ token
>> 
>>../HPL_dlamch.c:761:8: error: expected ‘)’ before ‘=’ token
>> 
>>../HPL_dlamch.c:761:8: error: expected ‘{’ before ‘=’ token
>> 
>>../HPL_dlamch.c:761:8: error: expected primary-expression
>>before ‘=’ token
>> 
>>../HPL_dlamch.c:762:8: error: expected primary-expression
>>before ‘try’
>> 
>>if( try <= ( -EMIN ) ) { lexp = try; exbits++; goto l_10; }
>> 
>>^
>> 
>>../HPL_dlamch.c:762:8: error: expected ‘)’ before ‘try’
>> 
>>../HPL_dlamch.c:762:36: error: expected primary-expression
>>before ‘try’
>> 
>>if( try <= ( -EMIN ) ) { lexp = try; exbits++; goto l_10; }
>> 
>>^
>> 
>>../HPL_dlamch.c:762:36: error: expected ‘;’ before ‘try’
>> 
>>../HPL_dlamch.c:764:26: error: ‘uexp’ was not declared in this
>>scope
>> 
>>if( lexp == -EMIN ) { uexp = lexp; } else { uexp = try;
>>exbits++; }
>> 
>>^
>> 
>>../HPL_dlamch.c:764:48: error: ‘uexp’ was not declared in this
>>scope
>> 
>>if( lexp == -EMIN ) { uexp = lexp; } else { uexp = try;
>>exbits++; }
>> 
>>^
>> 
>>../HPL_dlamch.c:764:55: e

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users
I will try to take a look at it today.

-Nathan

> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
>  wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.
> 
> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.
> 
> All measurements were done using Open MPI 3.1.2 with 
> OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected as 
> well?
> 
> Thanks,
> Joseph
> 
> 
> On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:
>> All of this is completely expected. Due to the requirements of the standard 
>> it is difficult to make use of network atomics even for MPI_Compare_and_swap 
>> (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
>> MPI_Fetch_and_op to be fast set this MCA parameter:
>> osc_rdma_acc_single_intrinsic=true
>> Shared lock is slower than an exclusive lock because there is an extra lock 
>> step as part of the accumulate (it isn't needed if there is an exclusive 
>> lock). When setting the above parameter you are telling the implementation 
>> that you will only be using a single count and we can optimize that with the 
>> hardware. The RMA working group is working on an info key that will 
>> essentially do the same thing.
>> Note the above parameter won't help you with IB if you are using UCX unless 
>> you set this (master only right now):
>> btl_uct_transports=dc_mlx5
>> btl=self,vader,uct
>> osc=^ucx
>> Though there may be a way to get osc/ucx to enable the same sort of 
>> optimization. I don't know.
>> -Nathan
>> On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:
>>> All,
>>> 
>>> I am currently experimenting with MPI atomic operations and wanted to
>>> share some interesting results I am observing. The numbers below are
>>> measurements from both an IB-based cluster and our Cray XC40. The
>>> benchmarks look like the following snippet:
>>> 
>>> ```
>>> if (rank == 1) {
>>> uint64_t res, val;
>>> for (size_t i = 0; i < NUM_REPS; ++i) {
>>> MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
>>> MPI_Win_flush(target, win);
>>> }
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> ```
>>> 
>>> Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
>>> have tried to confirm that the operations are done in hardware by
>>> letting rank 0 sleep for a while and ensuring that communication
>>> progresses). Of particular interest for my use-case is fetch_op but I am
>>> including other operations here nevertheless:
>>> 
>>> * Linux Cluster, IB QDR *
>>> average of 10 iterations
>>> 
>>> Exclusive lock, MPI_UINT32_T:
>>> fetch_op: 4.323384us
>>> compare_exchange: 2.035905us
>>> accumulate: 4.326358us
>>> get_accumulate: 4.334831us
>>> 
>>> Exclusive lock, MPI_UINT64_T:
>>> fetch_op: 2.438080us
>>> compare_exchange: 2.398836us
>>> accumulate: 2.435378us
>>> get_accumulate: 2.448347us
>>> 
>>> Shared lock, MPI_UINT32_T:
>>> fetch_op: 6.819977us
>>> compare_exchange: 4.551417us
>>> accumulate: 6.807766us
>>> get_accumulate: 6.817602us
>>> 
>>> Shared lock, MPI_UINT64_T:
>>> fetch_op: 4.954860us
>>> compare_exchange: 2.399373us
>>> accumulate: 4.965702us
>>> get_accumulate: 4.977876us
>>> 
>>> There are two interesting observations:
>>> a) operations on 64bit operands generally seem to have lower latencies
>>> than operations on 32bit
>>> b) Using an exclusive lock leads to lower latencies
>>> 
>>> Overall, there is a factor of almost 3 between SharedLock+uint32_t and
>>> ExclusiveLock+uint64_t for fetch_and_op, accumulate, and ge

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Nathan Hjelm via users











> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
>  wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.


1) On Aries 32-bit and 64-bit CAS operations should have similar performance. 
This looks like a bug and I will try to track it down now.

2) On Infiniband when using verbs we only have access to 64-bit atomic memory 
operations (limitation of the now-dead btl/openib component). I think there may 
be support in UCX for 32-bit AMOs but the support is not implemented in Open 
MPI (at least not in btl/uct). I can take a look at btl/uct and see what I find.

> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.

This is expected given the current implementation. When doing MPI_OP_NO_OP it 
falls back to the lock + get. I suppose I can change it to use MPI_SUM with an 
operand of 0. Will investigate.


-Nathan
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Nathan Hjelm via users

THAT is a good idea. When using Omnipath we see an issue with stale files in 
/dev/shm if the application exits abnormally. I don't know if UCX uses that 
space as well.


-Nathan

On June 20, 2019 at 11:05 AM, Joseph Schuchart via users 
 wrote:


Noam,

Another idea: check for stale files in /dev/shm/ (or a subdirectory that
looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs
-m`.

Joseph

On 6/20/19 3:31 PM, Noam Bernstein via users wrote:





On Jun 20, 2019, at 4:44 AM, Charles A Taylor mailto:chas...@ufl.edu>> wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x.
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely
included the fix.


Unfortunately, 4.0.0 behaves the same.


One thing that I’m wondering if anyone familiar with the internals can
explain is how you get a memory leak that isn’t freed when then program
ends?  Doesn’t that suggest that it’s something lower level, like maybe
a kernel issue?


Noam



|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] undefined reference error related to ucx

2019-06-26 Thread Nathan Hjelm via users
Unless you are using OSMEM I do not recommend using UCX on a Cray. You will 
likely get better performance with the built-in uGNI support.

-Nathan

> On Jun 25, 2019, at 1:51 AM, Passant A. Hafez via users 
>  wrote:
> 
> Thanks Gilles!
> 
> The thing is I'm having this error
> ud_iface.c:271  UCX Assertion `qp_init_attr.cap.max_inline_data >= 
> UCT_UD_MIN_INLINE' failed
> and core files.
> 
> I looked that up and it was suggested here 
> https://github.com/openucx/ucx/issues/3336 that the UCX 1.6 might solve this 
> issue, so I tried the pre-release version to just check if it will.
> 
> 
> 
> 
> All the best,
> --
> Passant 
> 
> 
> From: users  on behalf of Gilles 
> Gouaillardet via users 
> Sent: Tuesday, June 25, 2019 11:27 AM
> To: Open MPI Users
> Cc: Gilles Gouaillardet
> Subject: Re: [OMPI users] undefined reference error related to ucx
> 
> Passant,
> 
> UCX 1.6.0 is not yet officially released, and it seems Open MPI
> (4.0.1) does not support it yet, and some porting is needed.
> 
> Cheers,
> 
> Gilles
> 
> On Tue, Jun 25, 2019 at 5:13 PM Passant A. Hafez via users
>  wrote:
>> 
>> Hello,
>> 
>> 
>> I'm trying to build ompi 4.0.1 with external ucx 1.6.0 but I'm getting
>> 
>> 
>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
>> `uct_ep_create_connected'
>> collect2: error: ld returned 1 exit status
>> 
>> configure line for ompi
>> ./configure --prefix=/opt/ompi401_ucx16 --with-slurm --with-hwloc=internal 
>> --with-pmix=internal --enable-shared --enable-static --with-x 
>> --with-ucx=/opt/ucx-1.6.0
>> 
>> configure line for ucx
>> ./configure --prefix=/opt/ucx-1.6.0
>> 
>> 
>> What could be the reason?
>> 
>> 
>> 
>> 
>> 
>> 
>> All the best,
>> --
>> Passant
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] How it the rank determined (Open MPI and Podman)

2019-07-21 Thread Nathan Hjelm via users
Patches are always welcome. What would be great is a nice big warning that CMA 
support is disabled because the processes are on different namespaces. Ideally 
all MPI processes should be on the same namespace to ensure the best 
performance. 

-Nathan

> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
>  wrote:
> 
> For completeness I am mentioning my results also here.
> 
> To be able to mount file systems in the container it can only work if
> user namespaces are used and even if the user IDs are all the same (in
> each container and on the host), to be able to ptrace the kernel also
> checks if the processes are in the same user namespace (in addition to
> being owned by the same user). This check - same user namespace - fails
> and so process_vm_readv() and process_vm_writev() will also fail.
> 
> So Open MPI's checks are currently not enough to detect if 'cma' can be
> used. Checking for the same user namespace would also be necessary.
> 
> Is this a use case important enough to accept a patch for it?
> 
>Adrian
> 
>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
>> Gilles,
>> 
>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
>> indeed.
>> 
>> The default seems to be 'cma' and that seems to use process_vm_readv()
>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
>> telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE'
>> does not seem to be enough. Not sure yet if this related to the fact
>> that Podman is running rootless. I will continue to investigate, but now
>> I know where to look. Thanks!
>> 
>>Adrian
>> 
>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users 
>>> wrote:
>>> Adrian,
>>> 
>>> Can you try
>>> mpirun --mca btl_vader_copy_mechanism none ...
>>> 
>>> Please double check the MCA parameter name, I am AFK
>>> 
>>> IIRC, the default copy mechanism used by vader directly accesses the remote 
>>> process address space, and this requires some permission (ptrace?) that 
>>> might be dropped by podman.
>>> 
>>> Note Open MPI might not detect both MPI tasks run on the same node because 
>>> of podman.
>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used instead)
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Sent from my iPod
>>> 
 On Jul 12, 2019, at 18:33, Adrian Reber via users 
  wrote:
 
 So upstream Podman was really fast and merged a PR which makes my
 wrapper unnecessary:
 
 Add support for --env-host : https://github.com/containers/libpod/pull/3557
 
 As commented in the PR I can now start mpirun with Podman without a
 wrapper:
 
 $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
 podman run --env-host --security-opt label=disable -v 
 /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
 /home/mpi/ring
 Rank 0 has cleared MPI_Init
 Rank 1 has cleared MPI_Init
 Rank 0 has completed ring
 Rank 0 has completed MPI_Barrier
 Rank 1 has completed ring
 Rank 1 has completed MPI_Barrier
 
 This is example was using TCP and on an InfiniBand based system I have
 to map the InfiniBand devices into the container.
 
 $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
 /tmp/podman-mpirun podman run --env-host -v 
 /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
 --userns=keep-id --device /dev/infiniband/uverbs0 --device 
 /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
 /home/mpi/ring
 Rank 0 has cleared MPI_Init
 Rank 1 has cleared MPI_Init
 Rank 0 has completed ring
 Rank 0 has completed MPI_Barrier
 Rank 1 has completed ring
 Rank 1 has completed MPI_Barrier
 
 This is all running without root and only using Podman's rootless
 support.
 
 Running multiple processes on one system, however, still gives me an
 error. If I disable vader I guess that Open MPI is using TCP for
 localhost communication and that works. But with vader it fails.
 
 The first error message I get is a segfault:
 
 [test1:1] *** Process received signal ***
 [test1:1] Signal: Segmentation fault (11)
 [test1:1] Signal code: Address not mapped (1)
 [test1:1] Failing at address: 0x7fb7b1552010
 [test1:1] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
 [test1:1] [ 1] 
 /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
 [test1:1] [ 2] 
 /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
 [test1:1] [ 3] 
 /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
 [test1:1] [ 4] 
 /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
 [test1:1] [ 5

Re: [OMPI users] How is the rank determined (Open MPI and Podman)

2019-07-22 Thread Nathan Hjelm via users
Just add it to the existing modex.

-Nathan

> On Jul 22, 2019, at 12:20 PM, Adrian Reber via users 
>  wrote:
> 
> I have most of the code ready, but I still have troubles doing
> OPAL_MODEX_RECV. I am using the following lines, based on the code from
> orte/test/mpi/pmix.c:
> 
> OPAL_MODEX_SEND_VALUE(rc, OPAL_PMIX_LOCAL, "user_ns_id", &value, OPAL_INT);
> 
> This sets rc to 0. For receiving:
> 
> OPAL_MODEX_RECV_VALUE(rc, "user_ns_id", &wildcard_rank, &ptr, OPAL_INT);
> 
> and rc is always set to -13. Is this how it is supposed to work, or do I
> have to do it differently?
> 
>Adrian
> 
>> On Mon, Jul 22, 2019 at 02:03:20PM +, Ralph Castain via users wrote:
>> If that works, then it might be possible to include the namespace ID in the 
>> job-info provided by PMIx at startup - would have to investigate, so please 
>> confirm that the modex option works first.
>> 
>>> On Jul 22, 2019, at 1:22 AM, Gilles Gouaillardet via users 
>>>  wrote:
>>> 
>>> Adrian,
>>> 
>>> 
>>> An option is to involve the modex.
>>> 
>>> each task would OPAL_MODEX_SEND() its own namespace ID, and then 
>>> OPAL_MODEX_RECV()
>>> 
>>> the one from its peers and decide whether CMA support can be enabled.
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Gilles
>>> 
 On 7/22/2019 4:53 PM, Adrian Reber via users wrote:
 I had a look at it and not sure if it really makes sense.
 
 In btl_vader_{put,get}.c it would be easy to check for the user
 namespace ID of the other process, but the function would then just
 return OPAL_ERROR a bit earlier instead of as a result of
 process_vm_{read,write}v(). Nothing would really change.
 
 A better place for the check would be mca_btl_vader_check_single_copy()
 but I do not know if at this point the PID of the other processes is
 already known. Not sure if I can check for the user namespace ID of the
 other processes.
 
 Any recommendations how to do this?
 
Adrian
 
> On Sun, Jul 21, 2019 at 03:08:01PM -0400, Nathan Hjelm wrote:
> Patches are always welcome. What would be great is a nice big warning 
> that CMA support is disabled because the processes are on different 
> namespaces. Ideally all MPI processes should be on the same namespace to 
> ensure the best performance.
> 
> -Nathan
> 
>> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
>>  wrote:
>> 
>> For completeness I am mentioning my results also here.
>> 
>> To be able to mount file systems in the container it can only work if
>> user namespaces are used and even if the user IDs are all the same (in
>> each container and on the host), to be able to ptrace the kernel also
>> checks if the processes are in the same user namespace (in addition to
>> being owned by the same user). This check - same user namespace - fails
>> and so process_vm_readv() and process_vm_writev() will also fail.
>> 
>> So Open MPI's checks are currently not enough to detect if 'cma' can be
>> used. Checking for the same user namespace would also be necessary.
>> 
>> Is this a use case important enough to accept a patch for it?
>> 
>>   Adrian
>> 
>>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
>>> Gilles,
>>> 
>>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
>>> indeed.
>>> 
>>> The default seems to be 'cma' and that seems to use process_vm_readv()
>>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
>>> telling Podman to give the process CAP_SYS_PTRACE with 
>>> '--cap-add=SYS_PTRACE'
>>> does not seem to be enough. Not sure yet if this related to the fact
>>> that Podman is running rootless. I will continue to investigate, but now
>>> I know where to look. Thanks!
>>> 
>>>   Adrian
>>> 
 On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via 
 users wrote:
 Adrian,
 
 Can you try
 mpirun --mca btl_vader_copy_mechanism none ...
 
 Please double check the MCA parameter name, I am AFK
 
 IIRC, the default copy mechanism used by vader directly accesses the 
 remote process address space, and this requires some permission 
 (ptrace?) that might be dropped by podman.
 
 Note Open MPI might not detect both MPI tasks run on the same node 
 because of podman.
 If you use UCX, then btl/vader is not used at all (pml/ucx is used 
 instead)
 
 
 Cheers,
 
 Gilles
 
 Sent from my iPod
 
> On Jul 12, 2019, at 18:33, Adrian Reber via users 
>  wrote:
> 
> So upstream Podman was really fast and merged a PR which makes my
> wrapper unnecessary:
> 
> Add support for --env-host : 
> ht

Re: [OMPI users] OpenMPI slowdown in latency bound application

2019-08-28 Thread Nathan Hjelm via users
Is this overall runtime or solve time? The former is essentially meaningless as 
it includes all the startup time (launch, connections, etc). Especially since 
we are talking about seconds here.

-Nathan

> On Aug 28, 2019, at 9:10 AM, Cooper Burns via users 
>  wrote:
> 
> Peter,
> 
> It looks like:
> Node0:
> rank0, rank1, rank2, etc..
> Node1:
> rank12, rank13, etc
> etc
> 
> So the mapping looks good to me.
> 
> Thanks,
> Cooper
> Cooper Burns
> Senior Research Engineer
> 
> 
> (608) 230-1551
> convergecfd.com
> 
> 
> 
>> On Wed, Aug 28, 2019 at 10:50 AM Peter Kjellström  wrote:
>> On Wed, 28 Aug 2019 09:45:15 -0500
>> Cooper Burns  wrote:
>> 
>> > Peter,
>> > 
>> > Thanks for your input!
>> > I tried some things:
>> > 
>> > *1) The app was placed/pinned differently by the two MPIs. Often this
>> > would probably not cause such a big difference.*
>> > I agree this is unlikely the cause, however I tried various
>> > configurations of map-by, bind-to, etc and none of them had any
>> > measurable impact at all, which points to this not being the cause
>> > (as you suspected)
>> 
>> OK, there's still one thing to rule out, which rank was placed on which
>> node.
>> 
>> For OpenMPI you can pass "-report-bindings" and verify that the first N
>> ranks are placed on the first node (for N cores or ranks per node).
>> 
>> node0: r0 r4 r8 ...
>> node1: r1 ...
>> node2: r2 ...
>> node3: r3 ...
>> 
>> vs
>> 
>> node0: r0 r1 r2 r3 ...
>> 
>> > *2) Bad luck wrt collective performance. Different MPIs have
>> > different weak spots across the parameter space of
>> > numranks,transfersize,mpi-coll**ective.* This is possible... But the
>> > magnitude of the runtime difference seems too large to me... Are
>> > there any options we can give to OMPI to cause it to use different
>> > collective algorithms so that we can test this theory?
>> 
>> It can certainly cause the observed difference. I've seen very large
>> differences...
>> 
>> To get collective tunables from OpenMPI do something like:
>> 
>>  ompi_info --param coll all --level 5
>> 
>> But it will really help to know or suspect what collectives the
>> application depend on.
>> 
>> For example, if you suspected alltoall to be a factor you could sweep
>> all valid alltoall algorithms by setting:
>> 
>>  -mca coll coll_tuned_alltoall_algorithm X
>> 
>> Where X is 0..6 in my case (ompi_info returned: 0 ignore, 1 basic
>> linear, 2 bruck, 3 recursive doubling, 4 ring, 5 neighbor exchange, 6:
>> two proc only.)
>> 
>> /Peter
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Nathan Hjelm via users
The error is from btl/vader. CMA is not functioning as expected. It might work 
if you set btl_vader_single_copy_mechanism=none

Performance will suffer though. It would be worth understanding with 
process_readv is failing.

Can you send a simple reproducer?

-Nathan

> On Feb 24, 2020, at 2:59 PM, Gabriel, Edgar via users 
>  wrote:
> 
> 
> I am not an expert for the one-sided code in Open MPI, I wanted to comment 
> briefly on the potential MPI -IO related item. As far as I can see, the error 
> message
>  
> “Read -1, expected 48, errno = 1” 
> 
> does not stem from MPI I/O, at least not from the ompio library. What file 
> system did you use for these tests?
>  
> Thanks
> Edgar
>  
> From: users  On Behalf Of Matt Thompson via 
> users
> Sent: Monday, February 24, 2020 1:20 PM
> To: users@lists.open-mpi.org
> Cc: Matt Thompson 
> Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, 
> Fails in Open MPI
>  
> All,
>  
> My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm 
> not sure how to fix it. Namely, I'm currently trying to get an MPI project's 
> CI working on CircleCI using Open MPI to run some unit tests (on a single 
> node, so need some oversubscribe). I can build everything just fine, but when 
> I try to run, things just...blow up:
>  
> [root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 
> -oversubscribe /root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 
> 6 -ngo 1 -ngi 1 -v T,U -s mpi
>  start app rank:   0
>  start app rank:   1
>  start app rank:   2
>  start app rank:   3
>  start app rank:   4
>  start app rank:   5
> [3796b115c961:03629] Read -1, expected 48, errno = 1
> [3796b115c961:03629] *** An error occurred in MPI_Get
> [3796b115c961:03629] *** reported by process [2144600065,12]
> [3796b115c961:03629] *** on win rdma window 5
> [3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
> [3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [3796b115c961:03629] ***and potentially your MPI job)
>  
> I'm currently more concerned about the MPI_Get error, though I'm not sure 
> what that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now 
> this code is fairly fancy MPI code, so I decided to try a simpler one. 
> Searched the internet and found an example program here:
>  
> https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication
>  
> and when I build and run with Intel MPI it works:
>  
> (1027)(master) $ mpirun -V
> Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 
> 18555)
> Copyright 2003-2018 Intel Corporation.
> (1028)(master) $ mpiicc rma_test.c
> (1029)(master) $ mpirun -np 2 ./a.out
> srun.slurm: cluster configuration lacks support for cpu binding
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> Rank 0 gets data from the shared memory: 10 11 12 13
> Rank 1 gets data from the shared memory: 00 01 02 03
> Rank 0 has new data in the shared memory:Rank 1 has new data in the shared 
> memory: 10 11 12 13
>  00 01 02 03
>  
> So, I have some confidence it was written correctly. Now on the same system I 
> try with Open MPI (building with gcc, not Intel C):
>  
> (1032)(master) $ mpirun -V
> mpirun (Open MPI) 4.0.1
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1033)(master) $ mpicc rma_test.c
> (1034)(master) $ mpirun -np 2 ./a.out
> Rank 0 running on borgj001
> Rank 1 running on borgj001
> Rank 0 sets data in the shared memory: 00 01 02 03
> Rank 1 sets data in the shared memory: 10 11 12 13
> [borgj001:22668] *** An error occurred in MPI_Get
> [borgj001:22668] *** reported by process [2514223105,1]
> [borgj001:22668] *** on win rdma window 3
> [borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
> abort,
> [borgj001:22668] ***and potentially your MPI job)
> [borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
>  
> This is a similar failure to above. Any ideas what I might be doing wrong 
> here? I don't doubt I'm missing something, but I'm not sure what. Open MPI 
> was built pretty boringly:
>  
> Configure command line: '--with-slurm' '--enable-shared' 
> '--disable-wrapper-rpath' '--disable-wrapper-runpath' 
> '--enable-mca-no-build=btl-usnic' '--prefix=...'
>  
> And I'm not sure if we need those disable-wrapper bits anymore, but long ago 
> we needed them, and so they've lived on in "how to build" READMEs until 
> something breaks. This btl-usnic is a bit unknown to me (this was built by 
> sysadmins on a cluster), but this is pretty close to how I build on my 
> desk

Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ?

2021-03-04 Thread Nathan Hjelm via users
I would run the v4.x series and install xpmem if you can 
(http://github.com/hjelmn/xpmem ). You will 
need to build with —with-xpmem=/path/to/xpmem to use xpmem otherwise vader will 
default to using CMA. This will provide the best possible performance.

-Nathan

> On Mar 4, 2021, at 5:55 AM, Raut, S Biplab via users 
>  wrote:
> 
> [AMD Official Use Only - Internal Distribution Only]
>  
> It is a single node execution, so it should be using shared memory (vader).
>  
> With Regards,
> S. Biplab Raut
>  
> From: Heinz, Michael William  > 
> Sent: Thursday, March 4, 2021 5:17 PM
> To: Open MPI Users  >
> Cc: Raut, S Biplab mailto:biplab.r...@amd.com>>
> Subject: Re: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?
>  
> [CAUTION: External Email] 
> What interconnect are you using at run time? That is, are you using Ethernet 
> or InfiniBand or Omnipath?
> 
> Sent from my iPad
>  
> 
> On Mar 4, 2021, at 5:05 AM, Raut, S Biplab via users 
> mailto:users@lists.open-mpi.org>> wrote:
> 
>  
> [AMD Official Use Only - Internal Distribution Only]
>  
> After downloading a particular openMPI version, let’s say v3.1.1 from 
> https://download.open-mpi.org/release/open-mpi/v3.1/openmpi-3.1.1.tar.gz 
> 
>  , I follow the below steps.
> ./configure --prefix="$INSTALL_DIR" --enable-mpi-fortran --enable-mpi-cxx 
> --enable-shared=yes --enable-static=yes --enable-mpi1-compatibility
>   make -j
>   make install
>   export PATH=$INSTALL_DIR/bin:$PATH
>   export LD_LIBRARY_PATH=$INSTALL_DIR/lib:$LD_LIBRARY_PATH
> Additionally, I also install libnuma-dev on the machine.
>  
> For all the machines having Ubuntu 18.04 and 19.04, it works correctly and 
> results in expected performance/GFLOPS.
> But, when OS is changed to Ubuntu 20.04, then I start getting the issues as 
> mentioned in my original/previous mail below.
>  
> With Regards,
> S. Biplab Raut
>  
> From: users  > On Behalf Of John Hearns via users
> Sent: Thursday, March 4, 2021 1:53 PM
> To: Open MPI Users  >
> Cc: John Hearns mailto:hear...@gmail.com>>
> Subject: Re: [OMPI users] Stable and performant openMPI version for 
> Ubuntu20.04 ?
>  
> [CAUTION: External Email] 
> How are you installing the OpenMPI versions? Are you using packages which are 
> distributed by the OS?
>  
> It might be worth looking at using Easybuid or Spack
> https://docs.easybuild.io/en/latest/Introduction.html 
> 
> https://spack.readthedocs.io/en/latest/ 
> 
>  
>  
> On Thu, 4 Mar 2021 at 07:35, Raut, S Biplab via users 
> mailto:users@lists.open-mpi.org>> wrote:
> [AMD Official Use Only - Internal Distribution Only]
>  
> Dear Experts,
> Until recently, I was using openMPI3.1.1 to run 
> single node 128 ranks MPI application on Ubuntu18.04 and Ubuntu19.04.
> But, now the OS on these machines are upgraded to Ubuntu20.04, and I have 
> been observing program hangs with openMPI3.1.1 version.
> So, I tried with openMPI4.0.5 version – The program ran properly without any 
> issues but there is a performance regression in my application.
>  
> Can I know the stable openMPI version recommended for Ubuntu20.04 that has no 
> known regression compared to v3.1.1.
>  
> With Regards,
> S. Biplab Raut



Re: [OMPI users] Newbie With Issues

2021-03-30 Thread Nathan Hjelm via users

I find it bizarre that icc is looking for a C++ library. That aside if I 
remember correctly intel's compilers do not provide a C++ stdlib implementation 
but instead rely on the one from gcc. You need to verify that libstdc++ is 
installed on the system. On Ubuntu/debian this can be installed with apt-get 
install libstdc++. Not sure about other distros.

-Nathan

On March 30, 2021 at 11:09 AM, "bend linux4ms.net via users" 
 wrote:

I think I have found one of the issues. I took the check c program from openmpi
and tried to compile and got the following:

[root@jean-r8-sch24 benchmarks]# icc dummy.c 
ld: cannot find -lstdc++
[root@jean-r8-sch24 benchmarks]# cat dummy.c 
int

main ()
{

;
return 0;
}
[root@jean-r8-sch24 benchmarks]# 


Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor





From: users  on behalf of bend linux4ms.net via 
users 
Sent: Tuesday, March 30, 2021 12:00 PM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: Re: [OMPI users] Newbie With Issues

Thanks Mr Heinz for responding.

It maybe the case with clang, but doing a intel setvars.sh then issuing the 
following
compile gives me the message:

[root@jean-r8-sch24 openmpi-4.1.0]# icc
icc: command line error: no files specified; for help type "icc -help"
[root@jean-r8-sch24 openmpi-4.1.0]# icc -v
icc version 2021.1 (gcc version 8.3.1 compatibility)
[root@jean-r8-sch24 openmpi-4.1.0]#

Would lead me to believe that icc is still available to use.

This is a government contract and they want the latest and greatest.

Ben Duncan - Business Network Solutions, Inc. 336 Elton Road Jackson MS, 39212
"Never attribute to malice, that which can be adequately explained by stupidity"
- Hanlon's Razor





From: Heinz, Michael William 
Sent: Tuesday, March 30, 2021 11:52 AM
To: Open MPI Users
Cc: bend linux4ms.net
Subject: RE: Newbie With Issues

It looks like you're trying to build Open MPI with the Intel C compiler. TBH - 
I think that icc isn't included with the latest release of oneAPI, I think 
they've switched to including clang instead. I had a similar issue to yours but 
I resolved it by installing a 2020 version of the Intel HPC software. 
Unfortunately, those versions require purchasing a license.

-Original Message-
From: users  On Behalf Of bend linux4ms.net 
via users
Sent: Tuesday, March 30, 2021 12:42 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] Newbie With Issues

Hello group, My name is Ben Duncan. I have been tasked with installing openMPI 
and Intel compiler on a HPC systems. I am new to the the whole HPC and MPI 
environment so be patient with me.

I have successfully gotten the Intel compiler (oneapi version from 
l_HPCKit_p_2021.1.0.2684_offline.sh installed without any errors.

I am trying to install and configure the openMPI version 4.1.0 however trying 
to run configuration for openmpi gives me the following error:


== Configuring Open MPI


*** Startup tests
checking build system type... x86_64-unknown-linux-gnu checking host system 
type... x86_64-unknown-linux-gnu checking target system type... 
x86_64-unknown-linux-gnu checking for gcc... icc checking whether the C 
compiler works... no
configure: error: in `/p/app/openmpi-4.1.0':
configure: error: C compiler cannot create executables See `config.log' for 
more details

With the error in config.log being:

configure:6499: $? = 0
configure:6488: icc -qversion >&5
icc: command line warning #10006: ignoring unknown option '-qversion'
icc: command line error: no files specified; for help type "icc -help"
configure:6499: $? = 1
configure:6519: checking whether the C compiler works
configure:6541: icc -O2 conftest.c >&5
ld: cannot find -lstdc++
configure:6545: $? = 1
configure:6583: result: no
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "Open MPI"
| #define PACKAGE_TARNAME "openmpi"
| #define PACKAGE_VERSION "4.1.0"
| #define PACKAGE_STRING "Open MPI 4.1.0"
| #define PACKAGE_BUGREPORT 
"https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.open-mpi.org%2Fcommunity%2Fhelp%2F&data=04%7C01%7Cmichael.william.heinz%40cornelisnetworks.com%7C452071550e3c40842a6008d8f39b1ab4%7C4dbdb7da74ee4b458747ef5ce5ebe68a%7C0%7C0%7C637527194795401887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MqdORKp%2Fbf6mS7NQ51RjPUe0WVVcBITwP0HpxpYyBjI%3D&reserved=0";
| #define PACKAGE_URL ""
| #define OPAL_ARCH "x86_64-unknown-linux-gnu"
| /* end confdefs.h. */
|
| int
| main ()
| {
|
| ;
| return 0;
| }
configure:6588: error: in `/p/app/openmpi-4.1.0':
configure:6590: error: C compiler cannot create executables See `co

Re: [OMPI users] MPI_Get is slow with structs containing padding

2023-03-30 Thread Nathan Hjelm via users
Yes. This is absolutely normal. When you give MPI non-contiguous data it has to 
break out down into one operation per contiguous region. If you have a non-RDMA 
network Ross can lead to very poor performance. With RDMA networks it will also 
be much slower than a contiguous get but lower overhead per network operation.

-Nathan

> On Mar 30, 2023, at 10:43 AM, Antoine Motte via users 
>  wrote:
> 
> 
> Hello everyone,
> 
> I recently had to code an MPI application where I send std::vector contents 
> in a distributed environment. In order to try different approaches I coded 
> both 1-sided and 2-sided point-to-point communication schemes, the first one 
> uses MPI_Window and MPI_Get, the second one uses MPI_SendRecv.
> 
> I had a hard time figuring out why my implementation with MPI_Get was between 
> 10 and 100 times slower, and I finally found out that MPI_Get is abnormally 
> slow when one tries to send custom datatypes including padding.
> 
> Here is a short example attached, where I send a struct {double, int} (12 
> bytes of data + 4 bytes of padding) vs a struct {double, int, int} (16 bytes 
> of data, 0 bytes of padding) with both MPI_SendRecv and MPI_Get. I got these 
> results :
> 
> mpirun -np 4 ./compareGetWithSendRecv 
> {double, int} SendRecv : 0.0303547 s
> {double, int} Get : 1.9196 s
> {double, int, int} SendRecv : 0.0164659 s
> {double, int, int} Get : 0.0147757 s
> 
> I run it with both Open MPI 4.1.2 and with intel MPI 2021.6 and got the same 
> results.
> 
> Is this result normal? Do I have any solution other than adding garbage at 
> the end of the struct or at the end of the MPI_Datatype to avoid padding?
> 
> Regards,
> 
> Antoine Motte
> 
> 


Re: [OMPI users] MPI_Get is slow with structs containing padding

2023-03-30 Thread Nathan Hjelm via users

That is exactly the issue. Part of the reason I have argued against MPI_SHORT_INT 
usage in RMA because even though it is padded due to type alignment we are still not 
allowed to operate on the bits between the short and the int. We can correct that one 
in the standard by adding the same language as C (padding bits are undefined) but 
when a user gives us their own datatype we have no options.Yes, the best usage for 
the user is to keep the transfer completely contiguous. osc/rdma will break it down 
otherwise and with tcp that will be really horrible since each request becomes 
essentially a BTL active message.-NathanOn Mar 30, 2023, at 1:19 PM, Joseph Schuchart 
via users  wrote:Hi Antoine,That's an interesting 
result. I believe the problem with datatypes with gaps is that MPI is not allowed to 
touch the gaps. My guess is that for the RMA version of the benchmark the 
implementation either has to revert back to an active message packing the data at the 
target and sending it back or (which seems more likely in your case) transfer each 
object separately and skip the gaps. Without more information on your setup (using 
UCX?) and the benchmark itself (how many elements? what does the target do?) it's 
hard to be more precise.A possible fix would be to drop the MPI datatype for the RMA 
use and transfer the vector as a whole, using MPI_BYTE. I think there is also a way 
to modify the upper bound of the MPI type to remove the gap, using 
MPI_TYPE_CREATE_RESIZED. I expect that that will allow MPI to touch the gap and 
transfer the vector as a whole. I'm not sure about the details there, maybe someone 
can shed some light.HTHJosephOn 3/30/23 18:34, Antoine Motte via users wrote:Hello 
everyone,I recently had to code an MPI application where I send std::vector contents 
in a distributed environment. In order to try different approaches I coded both 
1-sided and 2-sided point-to-point communication schemes, the first one uses 
MPI_Window and MPI_Get, the second one uses MPI_SendRecv.I had a hard time figuring 
out why my implementation with MPI_Get was between 10 and 100 times slower, and I 
finally found out that MPI_Get is abnormally slow when one tries to send custom 
datatypes including padding.Here is a short example attached, where I send a struct 
{double, int} (12 bytes of data + 4 bytes of padding) vs a struct {double, int, int} 
(16 bytes of data, 0 bytes of padding) with both MPI_SendRecv and MPI_Get. I got 
these results :mpirun -np 4 ./compareGetWithSendRecv{double, int} SendRecv : 
0.0303547 s{double, int} Get : 1.9196 s{double, int, int} SendRecv : 0.0164659 
s{double, int, int} Get : 0.0147757 sI run it with both Open MPI 4.1.2 and with intel 
MPI 2021.6 and got the same results.Is this result normal? Do I have any solution 
other than adding garbage at the end of the struct or at the end of the MPI_Datatype 
to avoid padding?Regards,Antoine Motte

Re: [OMPI users] Binding to thread 0

2023-09-11 Thread Nathan Hjelm via users

Isn't this a case for --map-by core --bind-to hwthread? Because you want to map each 
process by core but bind the the first hwthread.From the looks of it your process is 
both binding and mapping by hwthread now. -NathanOn Sep 11, 2023, at 10:20 AM, Luis 
Cebamanos via users  wrote:@Gilles @Jeff Sorry, I 
think I replied too quickly. This is what I
   see if using --bind-to hwthread This is not what I 
was after. I only want to use thread 0 of a core
   ie (cores 0-7), so "cores 192-199" should not have any activity. If
   I do --bind-to core, the activity jumps from "core 0" to "core 192",
   and I want to avoid that.  Any other suggestion?  Regards L On 08/09/2023 
17:53, Jeff Squyres
 (jsquyres) wrote:In addition to what Gilles mentioned, I'm curious: is 
there a
   reason you have hardware threads enabled?  You could disable
   them in the BIOS, and then each of your MPI processes can use
   the full core, not just a single hardware thread.From: users 
 on behalf of Luis
 Cebamanos via users  Sent: Friday, September 8, 2023 
7:10 AM To: Ralph Castain via users  Cc: Luis Cebamanos 
 Subject: [OMPI users] Binding to thread 0  Hello,Up to now, I have been 
using numerous ways of binding
   with wrappers (numactl, taskset) whenever I wanted to play
   with core placing. Another way I have been using is via
   -rankfile, however I notice that some ranks jump from thread
   0 to thread 1 on SMT chips. I can control this with numactl
   for instance, but it would be great to see similar behaviour
   when using -rankfile. Is there a way to pack all ranks to
   one of the threads of each core (preferibly to thread 0) so
   I can nicely see all ranks with htop on either left or right
   of the screen?The command I am using is pretty simple:mpirun -np 
$MPIRANKS --rankfile ./myrankfile and ./myrankfile looks likerank 33=argon 
slot=33 rank 34=argon slot=34 rank 35=argon slot=35 rank 36=argon slot=36Thanks!