Gilles, thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps indeed.
The default seems to be 'cma' and that seems to use process_vm_readv() and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE' does not seem to be enough. Not sure yet if this related to the fact that Podman is running rootless. I will continue to investigate, but now I know where to look. Thanks! Adrian On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote: > Adrian, > > Can you try > mpirun --mca btl_vader_copy_mechanism none ... > > Please double check the MCA parameter name, I am AFK > > IIRC, the default copy mechanism used by vader directly accesses the remote > process address space, and this requires some permission (ptrace?) that might > be dropped by podman. > > Note Open MPI might not detect both MPI tasks run on the same node because of > podman. > If you use UCX, then btl/vader is not used at all (pml/ucx is used instead) > > > Cheers, > > Gilles > > Sent from my iPod > > > On Jul 12, 2019, at 18:33, Adrian Reber via users > > <users@lists.open-mpi.org> wrote: > > > > So upstream Podman was really fast and merged a PR which makes my > > wrapper unnecessary: > > > > Add support for --env-host : https://github.com/containers/libpod/pull/3557 > > > > As commented in the PR I can now start mpirun with Podman without a > > wrapper: > > > > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun > > podman run --env-host --security-opt label=disable -v > > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test > > /home/mpi/ring > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 0 has completed ring > > Rank 0 has completed MPI_Barrier > > Rank 1 has completed ring > > Rank 1 has completed MPI_Barrier > > > > This is example was using TCP and on an InfiniBand based system I have > > to map the InfiniBand devices into the container. > > > > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base > > /tmp/podman-mpirun podman run --env-host -v > > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable > > --userns=keep-id --device /dev/infiniband/uverbs0 --device > > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test > > /home/mpi/ring > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 0 has completed ring > > Rank 0 has completed MPI_Barrier > > Rank 1 has completed ring > > Rank 1 has completed MPI_Barrier > > > > This is all running without root and only using Podman's rootless > > support. > > > > Running multiple processes on one system, however, still gives me an > > error. If I disable vader I guess that Open MPI is using TCP for > > localhost communication and that works. But with vader it fails. > > > > The first error message I get is a segfault: > > > > [test1:00001] *** Process received signal *** > > [test1:00001] Signal: Segmentation fault (11) > > [test1:00001] Signal code: Address not mapped (1) > > [test1:00001] Failing at address: 0x7fb7b1552010 > > [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80] > > [test1:00001] [ 1] > > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b] > > [test1:00001] [ 2] > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb] > > [test1:00001] [ 3] > > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086] > > [test1:00001] [ 4] > > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d] > > [test1:00001] [ 5] /home/mpi/ring[0x400b76] > > [test1:00001] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813] > > [test1:00001] [ 7] /home/mpi/ring[0x4008be] > > [test1:00001] *** End of error message *** > > > > Guessing that vader uses shared memory this is expected to fail, with > > all the namespace isolations in place. Maybe not with a segfault, but > > each container has its own shared memory. So next step was to use the > > host's ipc and pid namespace and mount /dev/shm: > > > > '-v /dev/shm:/dev/shm --ipc=host --pid=host' > > > > Which does not segfault, but still does not look correct: > > > > Rank 0 has cleared MPI_Init > > Rank 1 has cleared MPI_Init > > Rank 2 has cleared MPI_Init > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > [test1:17722] Read -1, expected 80000, errno = 1 > > Rank 0 has completed ring > > Rank 2 has completed ring > > Rank 0 has completed MPI_Barrier > > Rank 1 has completed ring > > Rank 2 has completed MPI_Barrier > > Rank 1 has completed MPI_Barrier > > > > This is using the Open MPI ring.c example with SIZE increased from 20 to > > 20000. > > > > Any recommendations what vader needs to communicate correctly? > > > > Adrian > > > >> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote: > >> Gilles, > >> > >> thanks for pointing out the environment variables. I quickly created a > >> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables > >> (grep "\(PMIX\|OMPI\)"). Now it works: > >> > >> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id > >> --net=host mpi-test /home/mpi/hello > >> > >> Hello, world (2 procs total) > >> --> Process # 0 of 2 is alive. ->test1 > >> --> Process # 1 of 2 is alive. ->test2 > >> > >> I need to tell Podman to mount /tmp from the host into the container, as > >> I am running rootless I also need to tell Podman to use the same user ID > >> in the container as outside (so that the Open MPI files in /tmp) can be > >> shared and I am also running without a network namespace. > >> > >> So this is now with the full Podman provided isolation except the > >> network namespace. Thanks for you help! > >> > >> Adrian > >> > >>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users > >>> wrote: > >>> Adrian, > >>> > >>> > >>> the MPI application relies on some environment variables (they typically > >>> start with OMPI_ and PMIX_). > >>> > >>> The MPI application internally uses a PMIx client that must be able to > >>> contact a PMIx server > >>> > >>> (that is included in mpirun and the orted daemon(s) spawned on the remote > >>> hosts). > >>> > >>> located on the same host. > >>> > >>> > >>> If podman provides some isolation between the app inside the container > >>> (e.g. > >>> /home/mpi/hello) > >>> > >>> and the outside world (e.g. mpirun/orted), that won't be an easy ride. > >>> > >>> > >>> Cheers, > >>> > >>> > >>> Gilles > >>> > >>> > >>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote: > >>>> I did a quick test to see if I can use Podman in combination with Open > >>>> MPI: > >>>> > >>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run > >>>> quay.io/adrianreber/mpi-test /home/mpi/hello > >>>> > >>>> Hello, world (1 procs total) > >>>> --> Process # 0 of 1 is alive. ->789b8fb622ef > >>>> > >>>> Hello, world (1 procs total) > >>>> --> Process # 0 of 1 is alive. ->749eb4e1c01a > >>>> > >>>> The test program (hello) is taken from > >>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c > >>>> > >>>> > >>>> The problem with this is that each process thinks it is process 0 of 1 > >>>> instead of > >>>> > >>>> Hello, world (2 procs total) > >>>> --> Process # 1 of 2 is alive. ->test1 > >>>> --> Process # 0 of 2 is alive. ->test2 > >>>> > >>>> My questions is how is the rank determined? What resources do I need to > >>>> have > >>>> in my container to correctly determine the rank. > >>>> > >>>> This is Podman 1.4.2 and Open MPI 4.0.1. > >>>> > >>>> Adrian > >>>> _______________________________________________ > >>>> users mailing list > >>>> users@lists.open-mpi.org > >>>> https://lists.open-mpi.org/mailman/listinfo/users > >>>> > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://lists.open-mpi.org/mailman/listinfo/users > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users