Thanks for confirming that it works for you as well. I have a PR open on v3.1.x 
that brings osc/rdma up to date with master. I will also be bringing some code 
that greatly improves the multi-threaded RMA performance on Aries systems (at 
least with benchmarks— github.com/hpc/rma-mt). That will not make it into 
v3.1.x but will be in v4.0.0.

-Nathan

> On May 9, 2018, at 1:26 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> 
> Nathan,
> 
> Thank you, I can confirm that it works as expected with master on our system. 
> I will stick to this version then until 3.1.1 is out.
> 
> Joseph
> 
> On 05/08/2018 05:34 PM, Nathan Hjelm wrote:
>> Looks like it doesn't fail with master so at some point I fixed this bug. 
>> The current plan is to bring all the master changes into v3.1.1. This 
>> includes a number of bug fixes.
>> -Nathan
>> On May 08, 2018, at 08:25 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>> Nathan,
>>> 
>>> Thanks for looking into that. My test program is attached.
>>> 
>>> Best
>>> Joseph
>>> 
>>> On 05/08/2018 02:56 PM, Nathan Hjelm wrote:
>>>> I will take a look today. Can you send me your test program?
>>>> 
>>>> -Nathan
>>>> 
>>>>> On May 8, 2018, at 2:49 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>>>> 
>>>>> All,
>>>>> 
>>>>> I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
>>>>> (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
>>>>> Unfortunately, a simple (single-threaded) test case consisting of two 
>>>>> processes performing an MPI_Rget+MPI_Wait hangs when running on two 
>>>>> nodes. It succeeds if both processes run on a single node.
>>>>> 
>>>>> For completeness, I am attaching the config.log. The build environment 
>>>>> was set up to build Open MPI for the login nodes (I wasn't sure how to 
>>>>> properly cross-compile the libraries):
>>>>> 
>>>>> ```
>>>>> # this seems necessary to avoid a linker error during build
>>>>> export CRAYPE_LINK_TYPE=dynamic
>>>>> module swap PrgEnv-cray PrgEnv-intel
>>>>> module sw craype-haswell craype-sandybridge
>>>>> module unload craype-hugepages16M
>>>>> module unload cray-mpich
>>>>> ```
>>>>> 
>>>>> I am using mpirun to launch the test code. Below is the BTL debug log 
>>>>> (with tcp disabled for clarity, turning it on makes no difference):
>>>>> 
>>>>> ```
>>>>> mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
>>>>> [nid03060:36184] mca: base: components_register: registering framework 
>>>>> btl components
>>>>> [nid03060:36184] mca: base: components_register: found loaded component 
>>>>> self
>>>>> [nid03060:36184] mca: base: components_register: component self register 
>>>>> function successful
>>>>> [nid03060:36184] mca: base: components_register: found loaded component sm
>>>>> [nid03061:36208] mca: base: components_register: registering framework 
>>>>> btl components
>>>>> [nid03061:36208] mca: base: components_register: found loaded component 
>>>>> self
>>>>> [nid03060:36184] mca: base: components_register: found loaded component 
>>>>> ugni
>>>>> [nid03061:36208] mca: base: components_register: component self register 
>>>>> function successful
>>>>> [nid03061:36208] mca: base: components_register: found loaded component sm
>>>>> [nid03061:36208] mca: base: components_register: found loaded component 
>>>>> ugni
>>>>> [nid03060:36184] mca: base: components_register: component ugni register 
>>>>> function successful
>>>>> [nid03060:36184] mca: base: components_register: found loaded component 
>>>>> vader
>>>>> [nid03061:36208] mca: base: components_register: component ugni register 
>>>>> function successful
>>>>> [nid03061:36208] mca: base: components_register: found loaded component 
>>>>> vader
>>>>> [nid03060:36184] mca: base: components_register: component vader register 
>>>>> function successful
>>>>> [nid03060:36184] mca: base: components_open: opening btl components
>>>>> [nid03060:36184] mca: base: components_open: found loaded component self
>>>>> [nid03060:36184] mca: base: components_open: component self open function 
>>>>> successful
>>>>> [nid03060:36184] mca: base: components_open: found loaded component ugni
>>>>> [nid03060:36184] mca: base: components_open: component ugni open function 
>>>>> successful
>>>>> [nid03060:36184] mca: base: components_open: found loaded component vader
>>>>> [nid03060:36184] mca: base: components_open: component vader open 
>>>>> function successful
>>>>> [nid03060:36184] select: initializing btl component self
>>>>> [nid03060:36184] select: init of component self returned success
>>>>> [nid03060:36184] select: initializing btl component ugni
>>>>> [nid03061:36208] mca: base: components_register: component vader register 
>>>>> function successful
>>>>> [nid03061:36208] mca: base: components_open: opening btl components
>>>>> [nid03061:36208] mca: base: components_open: found loaded component self
>>>>> [nid03061:36208] mca: base: components_open: component self open function 
>>>>> successful
>>>>> [nid03061:36208] mca: base: components_open: found loaded component ugni
>>>>> [nid03061:36208] mca: base: components_open: component ugni open function 
>>>>> successful
>>>>> [nid03061:36208] mca: base: components_open: found loaded component vader
>>>>> [nid03061:36208] mca: base: components_open: component vader open 
>>>>> function successful
>>>>> [nid03061:36208] select: initializing btl component self
>>>>> [nid03061:36208] select: init of component self returned success
>>>>> [nid03061:36208] select: initializing btl component ugni
>>>>> [nid03061:36208] select: init of component ugni returned success
>>>>> [nid03061:36208] select: initializing btl component vader
>>>>> [nid03061:36208] select: init of component vader returned failure
>>>>> [nid03061:36208] mca: base: close: component vader closed
>>>>> [nid03061:36208] mca: base: close: unloading component vader
>>>>> [nid03060:36184] select: init of component ugni returned success
>>>>> [nid03060:36184] select: initializing btl component vader
>>>>> [nid03060:36184] select: init of component vader returned failure
>>>>> [nid03060:36184] mca: base: close: component vader closed
>>>>> [nid03060:36184] mca: base: close: unloading component vader
>>>>> [nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on 
>>>>> node nid03061
>>>>> [nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on 
>>>>> node nid03060
>>>>> [nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on 
>>>>> node (null)
>>>>> [nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on 
>>>>> node (null)
>>>>> ```
>>>>> 
>>>>> It looks like the UGNI btl is being initialized correctly but then fails 
>>>>> to find the node to communicate with? Is there a way to get more 
>>>>> information? There doesn't seem to be an MCA parameter to increase 
>>>>> verbosity specifically of the UGNI btl.
>>>>> 
>>>>> Any help would be appreciated!
>>>>> 
>>>>> Cheers
>>>>> Joseph
>>>>> <config.log.tgz>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to