HI Brian,

Thanks for the info.   I'm not sure I quite get the response though.  Is
the race condition in the way
Open MPI Portals4 MTL is using portals or is a problem in the portals
implementation itself?

Howard


2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com>:

> Howard,
>
> Looks like ob1 is working fine. When I looked into the problems with ob1,
> it looked like the progress thread was polling the Portals event queue
> before it had been initialized.
>
> b.
>
> $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> # OSU MPI Latency Test
> # Size            Latency (us)
> 0                         1.87
> 1                         1.93
> 2                         1.90
> 4                         1.94
> 8                         1.94
> 16                        1.96
> 32                        1.97
> 64                        1.99
> 128                       2.43
> 256                       2.50
> 512                       2.71
> 1024                      3.01
> 2048                      3.45
> 4096                      4.56
> 8192                      6.39
> 16384                     8.79
> 32768                    11.50
> 65536                    16.59
> 131072                   27.10
> 262144                   46.97
> 524288                   87.55
> 1048576                 168.89
> 2097152                 331.40
> 4194304                 654.08
>
>
> On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> HI Brian,
>
> As a sanity check, can you see if the ob1 pml works okay, i.e.
>
>  mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency
>
> Howard
>
>
> 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>:
>
>> Hello,
>>
>> I’m doing some work with Portals4 and am trying to run some MPI programs
>> using the Portals 4 as the transport layer. I’m running into problems and
>> am hoping that someone can help me figure out how to get things working.
>> I’m using OpenMPI 3.0.0 with the following configuration:
>>
>> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky
>> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4
>> --disable-oshmem --disable-vt --disable-java --disable-mpi-io
>> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control
>> --disable-mtl-portals4-flow-control
>>
>> I have also tried the head from the git repo and 2.1.2 with the same
>> results. A simpler configure line (w —prefix and —with-portals4=) also gets
>> same results.
>>
>> Portals4 configuration is from github master and configured thus:
>>
>> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev
>> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>>
>> If I specify the cm pml on the command-line, I can get examples/hello_c
>> to run correctly. Trying to get some latency numbers using the OSU
>> benchmarks is where my trouble begins:
>>
>> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env
>> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> # OSU MPI Latency Test
>> # Size            Latency (us)
>> 0                        25.96
>> [node41:19740] *** An error occurred in MPI_Barrier
>> [node41:19740] *** reported by process [139815819542529,4294967297]
>> [node41:19740] *** on communicator MPI_COMM_WORLD
>> [node41:19740] *** MPI_ERR_OTHER: known error not in list
>> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [node41:19740] ***    and potentially your MPI job)
>>
>> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to
>> be a progress thread initialization problem.
>> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>>
>> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
>> # OSU MPI Latency Test
>> # Size            Latency (us)
>> 0                        24.14
>> 1                        26.24
>> [node41:19993] *** Process received signal ***
>> [node41:19993] Signal: Segmentation fault (11)
>> [node41:19993] Signal code: Address not mapped (1)
>> [node41:19993] Failing at address: 0x141
>> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710]
>> [node41:19993] [ 1] /ascldap/users/dblarki/opt/por
>> tals4.master/lib/libportals.so.4(+0xcd65)[0x7fa69b770d65]
>> [node41:19993] [ 2] /ascldap/users/dblarki/opt/por
>> tals4.master/lib/libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3]
>> [node41:19993] [ 3] /ascldap/users/dblarki/opt/omp
>> i/lib/openmpi/mca_mtl_portals4.so(+0xa961)[0x7fa698cf5961]
>> [node41:19993] [ 4] /ascldap/users/dblarki/opt/omp
>> i/lib/openmpi/mca_mtl_portals4.so(+0xb0e5)[0x7fa698cf60e5]
>> [node41:19993] [ 5] /ascldap/users/dblarki/opt/omp
>> i/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_send+
>> 0x90)[0x7fa698cf61d1]
>> [node41:19993] [ 6] /ascldap/users/dblarki/opt/omp
>> i/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430]
>> [node41:19993] [ 7] /ascldap/users/dblarki/opt/omp
>> i/lib/libmpi.so.40(PMPI_Send+0x2b4)[0x7fa6ac9ff018]
>> [node41:19993] [ 8] ./osu_latency[0x40106f]
>> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_
>> main+0xfd)[0x7fa6ac3b6d5d]
>> [node41:19993] [10] ./osu_latency[0x400c59]
>>
>> This cluster is running RHEL 6.5 without ummunotify modules, but I get
>> the same results on a local (small) cluster running ubuntu 16.04 with
>> ummunotify loaded.
>>
>> Any help would be much appreciated.
>> thanks,
>>
>> brian.
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> D. Brian Larkins
> Assistant Professor of Computer Science
> Rhodes College
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to