Thanks for confirming that it works for you as well. I have a PR open on v3.1.x that brings osc/rdma up to date with master. I will also be bringing some code that greatly improves the multi-threaded RMA performance on Aries systems (at least with benchmarks— github.com/hpc/rma-mt). That will not make it into v3.1.x but will be in v4.0.0.
-Nathan > On May 9, 2018, at 1:26 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > > Nathan, > > Thank you, I can confirm that it works as expected with master on our system. > I will stick to this version then until 3.1.1 is out. > > Joseph > > On 05/08/2018 05:34 PM, Nathan Hjelm wrote: >> Looks like it doesn't fail with master so at some point I fixed this bug. >> The current plan is to bring all the master changes into v3.1.1. This >> includes a number of bug fixes. >> -Nathan >> On May 08, 2018, at 08:25 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>> Nathan, >>> >>> Thanks for looking into that. My test program is attached. >>> >>> Best >>> Joseph >>> >>> On 05/08/2018 02:56 PM, Nathan Hjelm wrote: >>>> I will take a look today. Can you send me your test program? >>>> >>>> -Nathan >>>> >>>>> On May 8, 2018, at 2:49 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: >>>>> >>>>> All, >>>>> >>>>> I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 >>>>> (Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. >>>>> Unfortunately, a simple (single-threaded) test case consisting of two >>>>> processes performing an MPI_Rget+MPI_Wait hangs when running on two >>>>> nodes. It succeeds if both processes run on a single node. >>>>> >>>>> For completeness, I am attaching the config.log. The build environment >>>>> was set up to build Open MPI for the login nodes (I wasn't sure how to >>>>> properly cross-compile the libraries): >>>>> >>>>> ``` >>>>> # this seems necessary to avoid a linker error during build >>>>> export CRAYPE_LINK_TYPE=dynamic >>>>> module swap PrgEnv-cray PrgEnv-intel >>>>> module sw craype-haswell craype-sandybridge >>>>> module unload craype-hugepages16M >>>>> module unload cray-mpich >>>>> ``` >>>>> >>>>> I am using mpirun to launch the test code. Below is the BTL debug log >>>>> (with tcp disabled for clarity, turning it on makes no difference): >>>>> >>>>> ``` >>>>> mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop >>>>> [nid03060:36184] mca: base: components_register: registering framework >>>>> btl components >>>>> [nid03060:36184] mca: base: components_register: found loaded component >>>>> self >>>>> [nid03060:36184] mca: base: components_register: component self register >>>>> function successful >>>>> [nid03060:36184] mca: base: components_register: found loaded component sm >>>>> [nid03061:36208] mca: base: components_register: registering framework >>>>> btl components >>>>> [nid03061:36208] mca: base: components_register: found loaded component >>>>> self >>>>> [nid03060:36184] mca: base: components_register: found loaded component >>>>> ugni >>>>> [nid03061:36208] mca: base: components_register: component self register >>>>> function successful >>>>> [nid03061:36208] mca: base: components_register: found loaded component sm >>>>> [nid03061:36208] mca: base: components_register: found loaded component >>>>> ugni >>>>> [nid03060:36184] mca: base: components_register: component ugni register >>>>> function successful >>>>> [nid03060:36184] mca: base: components_register: found loaded component >>>>> vader >>>>> [nid03061:36208] mca: base: components_register: component ugni register >>>>> function successful >>>>> [nid03061:36208] mca: base: components_register: found loaded component >>>>> vader >>>>> [nid03060:36184] mca: base: components_register: component vader register >>>>> function successful >>>>> [nid03060:36184] mca: base: components_open: opening btl components >>>>> [nid03060:36184] mca: base: components_open: found loaded component self >>>>> [nid03060:36184] mca: base: components_open: component self open function >>>>> successful >>>>> [nid03060:36184] mca: base: components_open: found loaded component ugni >>>>> [nid03060:36184] mca: base: components_open: component ugni open function >>>>> successful >>>>> [nid03060:36184] mca: base: components_open: found loaded component vader >>>>> [nid03060:36184] mca: base: components_open: component vader open >>>>> function successful >>>>> [nid03060:36184] select: initializing btl component self >>>>> [nid03060:36184] select: init of component self returned success >>>>> [nid03060:36184] select: initializing btl component ugni >>>>> [nid03061:36208] mca: base: components_register: component vader register >>>>> function successful >>>>> [nid03061:36208] mca: base: components_open: opening btl components >>>>> [nid03061:36208] mca: base: components_open: found loaded component self >>>>> [nid03061:36208] mca: base: components_open: component self open function >>>>> successful >>>>> [nid03061:36208] mca: base: components_open: found loaded component ugni >>>>> [nid03061:36208] mca: base: components_open: component ugni open function >>>>> successful >>>>> [nid03061:36208] mca: base: components_open: found loaded component vader >>>>> [nid03061:36208] mca: base: components_open: component vader open >>>>> function successful >>>>> [nid03061:36208] select: initializing btl component self >>>>> [nid03061:36208] select: init of component self returned success >>>>> [nid03061:36208] select: initializing btl component ugni >>>>> [nid03061:36208] select: init of component ugni returned success >>>>> [nid03061:36208] select: initializing btl component vader >>>>> [nid03061:36208] select: init of component vader returned failure >>>>> [nid03061:36208] mca: base: close: component vader closed >>>>> [nid03061:36208] mca: base: close: unloading component vader >>>>> [nid03060:36184] select: init of component ugni returned success >>>>> [nid03060:36184] select: initializing btl component vader >>>>> [nid03060:36184] select: init of component vader returned failure >>>>> [nid03060:36184] mca: base: close: component vader closed >>>>> [nid03060:36184] mca: base: close: unloading component vader >>>>> [nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on >>>>> node nid03061 >>>>> [nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on >>>>> node nid03060 >>>>> [nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on >>>>> node (null) >>>>> [nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on >>>>> node (null) >>>>> ``` >>>>> >>>>> It looks like the UGNI btl is being initialized correctly but then fails >>>>> to find the node to communicate with? Is there a way to get more >>>>> information? There doesn't seem to be an MCA parameter to increase >>>>> verbosity specifically of the UGNI btl. >>>>> >>>>> Any help would be appreciated! >>>>> >>>>> Cheers >>>>> Joseph >>>>> <config.log.tgz> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users