Patches are always welcome. What would be great is a nice big warning that CMA 
support is disabled because the processes are on different namespaces. Ideally 
all MPI processes should be on the same namespace to ensure the best 
performance. 

-Nathan

> On Jul 21, 2019, at 2:53 PM, Adrian Reber via users 
> <users@lists.open-mpi.org> wrote:
> 
> For completeness I am mentioning my results also here.
> 
> To be able to mount file systems in the container it can only work if
> user namespaces are used and even if the user IDs are all the same (in
> each container and on the host), to be able to ptrace the kernel also
> checks if the processes are in the same user namespace (in addition to
> being owned by the same user). This check - same user namespace - fails
> and so process_vm_readv() and process_vm_writev() will also fail.
> 
> So Open MPI's checks are currently not enough to detect if 'cma' can be
> used. Checking for the same user namespace would also be necessary.
> 
> Is this a use case important enough to accept a patch for it?
> 
>        Adrian
> 
>> On Fri, Jul 12, 2019 at 03:42:15PM +0200, Adrian Reber via users wrote:
>> Gilles,
>> 
>> thanks again. Adding '--mca btl_vader_single_copy_mechanism none' helps
>> indeed.
>> 
>> The default seems to be 'cma' and that seems to use process_vm_readv()
>> and process_vm_writev(). That seems to require CAP_SYS_PTRACE, but
>> telling Podman to give the process CAP_SYS_PTRACE with '--cap-add=SYS_PTRACE'
>> does not seem to be enough. Not sure yet if this related to the fact
>> that Podman is running rootless. I will continue to investigate, but now
>> I know where to look. Thanks!
>> 
>>        Adrian
>> 
>>> On Fri, Jul 12, 2019 at 06:48:59PM +0900, Gilles Gouaillardet via users 
>>> wrote:
>>> Adrian,
>>> 
>>> Can you try
>>> mpirun --mca btl_vader_copy_mechanism none ...
>>> 
>>> Please double check the MCA parameter name, I am AFK
>>> 
>>> IIRC, the default copy mechanism used by vader directly accesses the remote 
>>> process address space, and this requires some permission (ptrace?) that 
>>> might be dropped by podman.
>>> 
>>> Note Open MPI might not detect both MPI tasks run on the same node because 
>>> of podman.
>>> If you use UCX, then btl/vader is not used at all (pml/ucx is used instead)
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Sent from my iPod
>>> 
>>>> On Jul 12, 2019, at 18:33, Adrian Reber via users 
>>>> <users@lists.open-mpi.org> wrote:
>>>> 
>>>> So upstream Podman was really fast and merged a PR which makes my
>>>> wrapper unnecessary:
>>>> 
>>>> Add support for --env-host : https://github.com/containers/libpod/pull/3557
>>>> 
>>>> As commented in the PR I can now start mpirun with Podman without a
>>>> wrapper:
>>>> 
>>>> $ mpirun --hostfile ~/hosts --mca orte_tmpdir_base /tmp/podman-mpirun 
>>>> podman run --env-host --security-opt label=disable -v 
>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --userns=keep-id --net=host mpi-test 
>>>> /home/mpi/ring
>>>> Rank 0 has cleared MPI_Init
>>>> Rank 1 has cleared MPI_Init
>>>> Rank 0 has completed ring
>>>> Rank 0 has completed MPI_Barrier
>>>> Rank 1 has completed ring
>>>> Rank 1 has completed MPI_Barrier
>>>> 
>>>> This is example was using TCP and on an InfiniBand based system I have
>>>> to map the InfiniBand devices into the container.
>>>> 
>>>> $ mpirun --mca btl ^openib --hostfile ~/hosts --mca orte_tmpdir_base 
>>>> /tmp/podman-mpirun podman run --env-host -v 
>>>> /tmp/podman-mpirun:/tmp/podman-mpirun --security-opt label=disable 
>>>> --userns=keep-id --device /dev/infiniband/uverbs0 --device 
>>>> /dev/infiniband/umad0 --device /dev/infiniband/rdma_cm --net=host mpi-test 
>>>> /home/mpi/ring
>>>> Rank 0 has cleared MPI_Init
>>>> Rank 1 has cleared MPI_Init
>>>> Rank 0 has completed ring
>>>> Rank 0 has completed MPI_Barrier
>>>> Rank 1 has completed ring
>>>> Rank 1 has completed MPI_Barrier
>>>> 
>>>> This is all running without root and only using Podman's rootless
>>>> support.
>>>> 
>>>> Running multiple processes on one system, however, still gives me an
>>>> error. If I disable vader I guess that Open MPI is using TCP for
>>>> localhost communication and that works. But with vader it fails.
>>>> 
>>>> The first error message I get is a segfault:
>>>> 
>>>> [test1:00001] *** Process received signal ***
>>>> [test1:00001] Signal: Segmentation fault (11)
>>>> [test1:00001] Signal code: Address not mapped (1)
>>>> [test1:00001] Failing at address: 0x7fb7b1552010
>>>> [test1:00001] [ 0] /lib64/libpthread.so.0(+0x12d80)[0x7f6299456d80]
>>>> [test1:00001] [ 1] 
>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_vader.so(mca_btl_vader_send+0x3db)[0x7f628b33ab0b]
>>>> [test1:00001] [ 2] 
>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rdma+0x1fb)[0x7f62901d24bb]
>>>> [test1:00001] [ 3] 
>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xfd6)[0x7f62901be086]
>>>> [test1:00001] [ 4] 
>>>> /usr/lib64/openmpi/lib/libmpi.so.40(PMPI_Send+0x1bd)[0x7f62996f862d]
>>>> [test1:00001] [ 5] /home/mpi/ring[0x400b76]
>>>> [test1:00001] [ 6] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f62990a3813]
>>>> [test1:00001] [ 7] /home/mpi/ring[0x4008be]
>>>> [test1:00001] *** End of error message ***
>>>> 
>>>> Guessing that vader uses shared memory this is expected to fail, with
>>>> all the namespace isolations in place. Maybe not with a segfault, but
>>>> each container has its own shared memory. So next step was to use the
>>>> host's ipc and pid namespace and mount /dev/shm:
>>>> 
>>>> '-v /dev/shm:/dev/shm --ipc=host --pid=host'
>>>> 
>>>> Which does not segfault, but still does not look correct:
>>>> 
>>>> Rank 0 has cleared MPI_Init
>>>> Rank 1 has cleared MPI_Init
>>>> Rank 2 has cleared MPI_Init
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> [test1:17722] Read -1, expected 80000, errno = 1
>>>> Rank 0 has completed ring
>>>> Rank 2 has completed ring
>>>> Rank 0 has completed MPI_Barrier
>>>> Rank 1 has completed ring
>>>> Rank 2 has completed MPI_Barrier
>>>> Rank 1 has completed MPI_Barrier
>>>> 
>>>> This is using the Open MPI ring.c example with SIZE increased from 20 to 
>>>> 20000.
>>>> 
>>>> Any recommendations what vader needs to communicate correctly?
>>>> 
>>>>       Adrian
>>>> 
>>>>> On Thu, Jul 11, 2019 at 12:07:35PM +0200, Adrian Reber via users wrote:
>>>>> Gilles,
>>>>> 
>>>>> thanks for pointing out the environment variables. I quickly created a
>>>>> wrapper which tells Podman to re-export all OMPI_ and PMIX_ variables
>>>>> (grep "\(PMIX\|OMPI\)"). Now it works:
>>>>> 
>>>>> $ mpirun --hostfile ~/hosts ./wrapper -v /tmp:/tmp --userns=keep-id 
>>>>> --net=host mpi-test /home/mpi/hello
>>>>> 
>>>>> Hello, world (2 procs total)
>>>>>   --> Process #   0 of   2 is alive. ->test1
>>>>>   --> Process #   1 of   2 is alive. ->test2
>>>>> 
>>>>> I need to tell Podman to mount /tmp from the host into the container, as
>>>>> I am running rootless I also need to tell Podman to use the same user ID
>>>>> in the container as outside (so that the Open MPI files in /tmp) can be
>>>>> shared and I am also running without a network namespace.
>>>>> 
>>>>> So this is now with the full Podman provided isolation except the
>>>>> network namespace. Thanks for you help!
>>>>> 
>>>>>       Adrian
>>>>> 
>>>>>> On Thu, Jul 11, 2019 at 04:47:21PM +0900, Gilles Gouaillardet via users 
>>>>>> wrote:
>>>>>> Adrian,
>>>>>> 
>>>>>> 
>>>>>> the MPI application relies on some environment variables (they typically
>>>>>> start with OMPI_ and PMIX_).
>>>>>> 
>>>>>> The MPI application internally uses a PMIx client that must be able to
>>>>>> contact a PMIx server
>>>>>> 
>>>>>> (that is included in mpirun and the orted daemon(s) spawned on the remote
>>>>>> hosts).
>>>>>> 
>>>>>> located on the same host.
>>>>>> 
>>>>>> 
>>>>>> If podman provides some isolation between the app inside the container 
>>>>>> (e.g.
>>>>>> /home/mpi/hello)
>>>>>> 
>>>>>> and the outside world (e.g. mpirun/orted), that won't be an easy ride.
>>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> 
>>>>>>> On 7/11/2019 4:35 PM, Adrian Reber via users wrote:
>>>>>>> I did a quick test to see if I can use Podman in combination with Open
>>>>>>> MPI:
>>>>>>> 
>>>>>>> [test@test1 ~]$ mpirun --hostfile ~/hosts podman run 
>>>>>>> quay.io/adrianreber/mpi-test /home/mpi/hello
>>>>>>> 
>>>>>>> Hello, world (1 procs total)
>>>>>>>    --> Process #   0 of   1 is alive. ->789b8fb622ef
>>>>>>> 
>>>>>>> Hello, world (1 procs total)
>>>>>>>    --> Process #   0 of   1 is alive. ->749eb4e1c01a
>>>>>>> 
>>>>>>> The test program (hello) is taken from 
>>>>>>> https://raw.githubusercontent.com/openhpc/ohpc/obs/OpenHPC_1.3.8_Factory/tests/mpi/hello.c
>>>>>>> 
>>>>>>> 
>>>>>>> The problem with this is that each process thinks it is process 0 of 1
>>>>>>> instead of
>>>>>>> 
>>>>>>> Hello, world (2 procs total)
>>>>>>>    --> Process #   1 of   2 is alive.  ->test1
>>>>>>>    --> Process #   0 of   2 is alive.  ->test2
>>>>>>> 
>>>>>>> My questions is how is the rank determined? What resources do I need to 
>>>>>>> have
>>>>>>> in my container to correctly determine the rank.
>>>>>>> 
>>>>>>> This is Podman 1.4.2 and Open MPI 4.0.1.
>>>>>>> 
>>>>>>>       Adrian
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to