HI Brian, Thanks for the info. I'm not sure I quite get the response though. Is the race condition in the way Open MPI Portals4 MTL is using portals or is a problem in the portals implementation itself?
Howard 2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com>: > Howard, > > Looks like ob1 is working fine. When I looked into the problems with ob1, > it looked like the progress thread was polling the Portals event queue > before it had been initialized. > > b. > > $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency > WARNING: Ummunotify not found: Not using ummunotify can result in > incorrect results download and install ummunotify from: > http://support.systemfabricworks.com/downloads/ummunotify/ > ummunotify-v2.tar.bz2 > WARNING: Ummunotify not found: Not using ummunotify can result in > incorrect results download and install ummunotify from: > http://support.systemfabricworks.com/downloads/ummunotify/ > ummunotify-v2.tar.bz2 > # OSU MPI Latency Test > # Size Latency (us) > 0 1.87 > 1 1.93 > 2 1.90 > 4 1.94 > 8 1.94 > 16 1.96 > 32 1.97 > 64 1.99 > 128 2.43 > 256 2.50 > 512 2.71 > 1024 3.01 > 2048 3.45 > 4096 4.56 > 8192 6.39 > 16384 8.79 > 32768 11.50 > 65536 16.59 > 131072 27.10 > 262144 46.97 > 524288 87.55 > 1048576 168.89 > 2097152 331.40 > 4194304 654.08 > > > On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote: > > HI Brian, > > As a sanity check, can you see if the ob1 pml works okay, i.e. > > mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency > > Howard > > > 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>: > >> Hello, >> >> I’m doing some work with Portals4 and am trying to run some MPI programs >> using the Portals 4 as the transport layer. I’m running into problems and >> am hoping that someone can help me figure out how to get things working. >> I’m using OpenMPI 3.0.0 with the following configuration: >> >> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky >> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4 >> --disable-oshmem --disable-vt --disable-java --disable-mpi-io >> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control >> --disable-mtl-portals4-flow-control >> >> I have also tried the head from the git repo and 2.1.2 with the same >> results. A simpler configure line (w —prefix and —with-portals4=) also gets >> same results. >> >> Portals4 configuration is from github master and configured thus: >> >> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev >> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered >> >> If I specify the cm pml on the command-line, I can get examples/hello_c >> to run correctly. Trying to get some latency numbers using the OSU >> benchmarks is where my trouble begins: >> >> $ mpirun -n 2 --mca mtl portals4 --mca pml cm env >> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> NOTE: Ummunotify and IB registered mem cache disabled, set >> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable. >> # OSU MPI Latency Test >> # Size Latency (us) >> 0 25.96 >> [node41:19740] *** An error occurred in MPI_Barrier >> [node41:19740] *** reported by process [139815819542529,4294967297] >> [node41:19740] *** on communicator MPI_COMM_WORLD >> [node41:19740] *** MPI_ERR_OTHER: known error not in list >> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [node41:19740] *** and potentially your MPI job) >> >> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to >> be a progress thread initialization problem. >> Using PTL_IGNORE_UMMUNOTIFY=1 gets here: >> >> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency >> # OSU MPI Latency Test >> # Size Latency (us) >> 0 24.14 >> 1 26.24 >> [node41:19993] *** Process received signal *** >> [node41:19993] Signal: Segmentation fault (11) >> [node41:19993] Signal code: Address not mapped (1) >> [node41:19993] Failing at address: 0x141 >> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710] >> [node41:19993] [ 1] /ascldap/users/dblarki/opt/por >> tals4.master/lib/libportals.so.4(+0xcd65)[0x7fa69b770d65] >> [node41:19993] [ 2] /ascldap/users/dblarki/opt/por >> tals4.master/lib/libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3] >> [node41:19993] [ 3] /ascldap/users/dblarki/opt/omp >> i/lib/openmpi/mca_mtl_portals4.so(+0xa961)[0x7fa698cf5961] >> [node41:19993] [ 4] /ascldap/users/dblarki/opt/omp >> i/lib/openmpi/mca_mtl_portals4.so(+0xb0e5)[0x7fa698cf60e5] >> [node41:19993] [ 5] /ascldap/users/dblarki/opt/omp >> i/lib/openmpi/mca_mtl_portals4.so(ompi_mtl_portals4_send+ >> 0x90)[0x7fa698cf61d1] >> [node41:19993] [ 6] /ascldap/users/dblarki/opt/omp >> i/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430] >> [node41:19993] [ 7] /ascldap/users/dblarki/opt/omp >> i/lib/libmpi.so.40(PMPI_Send+0x2b4)[0x7fa6ac9ff018] >> [node41:19993] [ 8] ./osu_latency[0x40106f] >> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_ >> main+0xfd)[0x7fa6ac3b6d5d] >> [node41:19993] [10] ./osu_latency[0x400c59] >> >> This cluster is running RHEL 6.5 without ummunotify modules, but I get >> the same results on a local (small) cluster running ubuntu 16.04 with >> ummunotify loaded. >> >> Any help would be much appreciated. >> thanks, >> >> brian. >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > -- > D. Brian Larkins > Assistant Professor of Computer Science > Rhodes College > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users