Gilles,

thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
indeed.

The default seems to be 'cma' and that seems to use process_vm_readv()
and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE'
does not seem to be enough. Not sure yet if this related to the fact
that Podman is running rootless. I will continue to investigate, but now
I know where to look. Thanks!

                Adrian

On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users wrote:
> Adrian,
> 
> Can you try
> mpirun --mca btl_vader_copy_mechanism none ...
> 
> Please double check the MCA parameter name, I am AFK
> 
> IIRC, the default copy mechanism used by vader directly accesses the remote 
> process address space, and this requires some permission (ptrace?) that might 
> be dropped by podman.
> 
> Note Open MPI might not detect both MPI tasks run on the same node because of 
> podman.
> If you use UCX, then btl/vader is not used at all (pml/ucx is used instead)
> 
> 
> Cheers,
> 
> Gilles
> 
> Sent from my iPod
> 
> > On Jul 12, 2019, at 18:33, Adrian Reber via users 
> > <users@lists.open-mpi.org> wrote:
> > 
> > So upstream Podman was really fast and merged a PR which makes my
> > wrapper unnecessary:
> > 
> > Add support for --env-host : https://github.com/containers/libpod/pull/3557
> > 
> > As commented in the PR I can now start mpirun with Podman without a
> > wrapper:
> > 
> > $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
> > podman run --env-host --security-opt label=disable -v 
> > /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
> > /home/mpi/ring
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 0 has completed ring
> > Rank 0 has completed MPI_Barrier
> > Rank 1 has completed ring
> > Rank 1 has completed MPI_Barrier
> > 
> > This is example was using TCP and on an InfiniBand based system I have
> > to map the InfiniBand devices into the container.
> > 
> > $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
> > /tmp/podman-mpirun podman run --env-host -v 
> > /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
> > --userns=keep-id --device /dev/infiniband/uverbs0 --device 
> > /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
> > /home/mpi/ring
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 0 has completed ring
> > Rank 0 has completed MPI_Barrier
> > Rank 1 has completed ring
> > Rank 1 has completed MPI_Barrier
> > 
> > This is all running without root and only using Podman's rootless
> > support.
> > 
> > Running multiple processes on one system, however, still gives me an
> > error. If I disable vader I guess that Open MPI is using TCP for
> > localhost communication and that works. But with vader it fails.
> > 
> > The first error message I get is a segfault:
> > 
> > [test1:00001] *** Process received signal ***
> > [test1:00001] Signal: Segmentation fault (11)
> > [test1:00001] Signal code: Address not mapped (1)
> > [test1:00001] Failing at address: 0x7fb7b1552010
> > [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
> > [test1:00001] [ 1] 
> > /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
> > [test1:00001] [ 2] 
> > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
> > [test1:00001] [ 3] 
> > /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
> > [test1:00001] [ 4] 
> > /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
> > [test1:00001] [ 5] /home/mpi/ring[0x400b76]
> > [test1:00001] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
> > [test1:00001] [ 7] /home/mpi/ring[0x4008be]
> > [test1:00001] *** End of error message ***
> > 
> > Guessing that vader uses shared memory this is expected to fail, with
> > all the namespace isolations in place. Maybe not with a segfault, but
> > each container has its own shared memory. So next step was to use the
> > host's ipc and pid namespace and mount /dev/shm:
> > 
> > '-v /dev/shm:/dev/shm --ipc=host --pid=host'
> > 
> > Which does not segfault, but still does not look correct:
> > 
> > Rank 0 has cleared MPI_Init
> > Rank 1 has cleared MPI_Init
> > Rank 2 has cleared MPI_Init
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > [test1:17722] Read -1, expected 80000, errno = 1
> > Rank 0 has completed ring
> > Rank 2 has completed ring
> > Rank 0 has completed MPI_Barrier
> > Rank 1 has completed ring
> > Rank 2 has completed MPI_Barrier
> > Rank 1 has completed MPI_Barrier
> > 
> > This is using the Open MPI ring.c example with SIZE increased from 20 to 
> > 20000.
> > 
> > Any recommendations what vader needs to communicate correctly?
> > 
> >        Adrian
> > 
> >> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
> >> Gilles,
> >> 
> >> thanks for pointing out the environment variables. I quickly created a
> >> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
> >> (grep "\(PMIX\|OMPI\)"). Now it works:
> >> 
> >> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
> >> --net=host mpi-test /home/mpi/hello
> >> 
> >> Hello, world (2 procs total)
> >>    --> Process #   0 of   2 is alive. ->test1
> >>    --> Process #   1 of   2 is alive. ->test2
> >> 
> >> I need to tell Podman to mount /tmp from the host into the container, as
> >> I am running rootless I also need to tell Podman to use the same user ID
> >> in the container as outside (so that the Open MPI files in /tmp) can be
> >> shared and I am also running without a network namespace.
> >> 
> >> So this is now with the full Podman provided isolation except the
> >> network namespace. Thanks for you help!
> >> 
> >>        Adrian
> >> 
> >>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users 
> >>> wrote:
> >>> Adrian,
> >>> 
> >>> 
> >>> the MPI application relies on some environment variables (they typically
> >>> start with OMPI_ and PMIX_).
> >>> 
> >>> The MPI application internally uses a PMIx client that must be able to
> >>> contact a PMIx server
> >>> 
> >>> (that is included in mpirun and the orted daemon(s) spawned on the remote
> >>> hosts).
> >>> 
> >>> located on the same host.
> >>> 
> >>> 
> >>> If podman provides some isolation between the app inside the container 
> >>> (e.g.
> >>> /home/mpi/hello)
> >>> 
> >>> and the outside world (e.g. mpirun/orted), that won't be an easy ride.
> >>> 
> >>> 
> >>> Cheers,
> >>> 
> >>> 
> >>> Gilles
> >>> 
> >>> 
> >>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
> >>>> I did a quick test to see if I can use Podman in combination with Open
> >>>> MPI:
> >>>> 
> >>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
> >>>> quay.io/adrianreber/mpi-test /home/mpi/hello
> >>>> 
> >>>>  Hello, world (1 procs total)
> >>>>     --> Process #   0 of   1 is alive. ->789b8fb622ef
> >>>> 
> >>>>  Hello, world (1 procs total)
> >>>>     --> Process #   0 of   1 is alive. ->749eb4e1c01a
> >>>> 
> >>>> The test program (hello) is taken from 
> >>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
> >>>> 
> >>>> 
> >>>> The problem with this is that each process thinks it is process 0 of 1
> >>>> instead of
> >>>> 
> >>>>  Hello, world (2 procs total)
> >>>>     --> Process #   1 of   2 is alive.  ->test1
> >>>>     --> Process #   0 of   2 is alive.  ->test2
> >>>> 
> >>>> My questions is how is the rank determined? What resources do I need to 
> >>>> have
> >>>> in my container to correctly determine the rank.
> >>>> 
> >>>> This is Podman 1.4.2 and Open MPI 4.0.1.
> >>>> 
> >>>>        Adrian
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users@lists.open-mpi.org
> >>>> https://lists.open-mpi.org/mailman/listinfo/users
> >>>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://lists.open-mpi.org/mailman/listinfo/users
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to