Jeff, all,

Unfortunately, I (as a user) have no control over the page size on our cluster. My interest in this is more of a general nature because I am concerned that our users who use Open MPI underneath our code run into this issue on their machine.

I took a look at the code for the various window creation methods and now have a better picture of the allocation process in Open MPI. I realized that memory in windows allocated through MPI_Win_alloc or created through MPI_Win_create is registered with the IB device using ibv_reg_mr, which takes significant time for large allocations (I assume this is where hugepages would help?). In contrast to this, it seems that memory attached through MPI_Win_attach is not registered, which explains the lower latency for these allocation I am observing (I seem to remember having observed higher communication latencies as well).

Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix component that uses shmem_open to create a POSIX shared memory object instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise the priority of this component above that of the default mmap component I end up with a SIGBUS during MPI_Init. No other errors are reported by MPI. Should I open a ticket on Github for this?

As an alternative, would it be possible to use anonymous shared memory mappings to avoid the backing file for large allocations (maybe above a certain threshold) on systems that support MAP_ANONYMOUS and distribute the result of the mmap call among the processes on the node?

Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:
I don't know any reason why you shouldn't be able to use IB for intra-node transfers. There are, of course, arguments against doing it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less synchronously than shared-memory, since I'm not aware of any MPI RMA library that dispatches the intranode RMA operations to an asynchronous agent (e.g. communication helper thread).

Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which doesn't sound unreasonable to me. You might investigate if/how you can use 2M or 1G pages instead. It's possible Open-MPI already supports this, if the underlying system does. You may need to twiddle your OS settings to get hugetlbfs working.

Jeff

On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:

    Jeff, all,

    Thanks for the clarification. My measurements show that global
    memory allocations do not require the backing file if there is only
    one process per node, for arbitrary number of processes. So I was
    wondering if it was possible to use the same allocation process even
    with multiple processes per node if there is not enough space
    available in /tmp. However, I am not sure whether the IB devices can
    be used to perform intra-node RMA. At least that would retain the
    functionality on this kind of system (which arguably might be a rare
    case).

    On a different note, I found during the weekend that Valgrind only
    supports allocations up to 60GB, so my second point reported below
    may be invalid. Number 4 seems still seems curious to me, though.

    Best
    Joseph

    On 08/25/2017 09:17 PM, Jeff Hammond wrote:

        There's no reason to do anything special for shared memory with
        a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). However, it would help debugging if MPI implementers at least
        had an option to take the code path that allocates shared memory
        even when np=1.

        Jeff

        On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
        <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

             Gilles,

             Thanks for your swift response. On this system, /dev/shm
        only has
             256M available so that is no option unfortunately. I tried
        disabling
             both vader and sm btl via `--mca btl ^vader,sm` but Open
        MPI still
             seems to allocate the shmem backing file under /tmp. From
        my point
             of view, missing the performance benefits of file backed shared
             memory as long as large allocations work but I don't know the
             implementation details and whether that is possible. It
        seems that
             the mmap does not happen if there is only one process per node.

             Cheers,
             Joseph


             On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

                 Joseph,

                 the error message suggests that allocating memory with
                 MPI_Win_allocate[_shared] is done by creating a file
        and then
                 mmap'ing
                 it.
                 how much space do you have in /dev/shm ? (this is a
        tmpfs e.g. a RAM
                 file system)
                 there is likely quite some space here, so as a
        workaround, i suggest
                 you use this as the shared-memory backing directory

                 /* i am afk and do not remember the syntax, ompi_info
        --all | grep
                 backing is likely to help */

                 Cheers,

                 Gilles

                 On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
                 <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

                     All,

                     I have been experimenting with large window allocations
                     recently and have
                     made some interesting observations that I would
        like to share.

                     The system under test:
                         - Linux cluster equipped with IB,
                         - Open MPI 2.1.1,
                         - 128GB main memory per node
                         - 6GB /tmp filesystem per node

                     My observations:
                     1) Running with 1 process on a single node, I can
        allocate
                     and write to
                     memory up to ~110 GB through MPI_Allocate,
        MPI_Win_allocate, and
                     MPI_Win_allocate_shared.

                     2) If running with 1 process per node on 2 nodes single
                     large allocations
                     succeed but with the repeating allocate/free cycle
        in the
                     attached code I
                     see the application being reproducibly being killed
        by the
                     OOM at 25GB
                     allocation with MPI_Win_allocate_shared. When I try
        to run
                     it under Valgrind
                     I get an error from MPI_Win_allocate at ~50GB that
        I cannot
                     make sense of:

                     ```
                     MPI_Alloc_mem:  53687091200 B
                     [n131302:11989] *** An error occurred in MPI_Alloc_mem
                     [n131302:11989] *** reported by process [1567293441,1]
                     [n131302:11989] *** on communicator MPI_COMM_WORLD
                     [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
                     [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
        in this
                     communicator
                     will now abort,
                     [n131302:11989] ***    and potentially your MPI job)
                     ```

                     3) If running with 2 processes on a node, I get the
                     following error from
                     both MPI_Win_allocate and MPI_Win_allocate_shared:
                     ```
--------------------------------------------------------------------------
                     It appears as if there is not enough space for
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
                     (the
                     shared-memory backing
                     file). It is likely that your MPI job will now
        either abort
                     or experience
                     performance degradation.

                         Local host:  n131702
                         Space Requested: 6710890760 B
                         Space Available: 6433673216 B
                     ```
                     This seems to be related to the size limit of /tmp.
                     MPI_Allocate works as
                     expected, i.e., I can allocate ~50GB per process. I
                     understand that I can
                     set $TMP to a bigger filesystem (such as lustre)
        but then I
                     am greeted with
                     a warning on each allocation and performance seems
        to drop.
                     Is there a way
                     to fall back to the allocation strategy used in
        case 2)?

                     4) It is also worth noting the time it takes to
        allocate the
                     memory: while
                     the allocations are in the sub-millisecond range
        for both
                     MPI_Allocate and
                     MPI_Win_allocate_shared, it takes >24s to allocate
        100GB using
                     MPI_Win_allocate and the time increasing linearly
        with the
                     allocation size.

                     Are these issues known? Maybe there is
        documentation describing
                     work-arounds? (esp. for 3) and 4))

                     I am attaching a small benchmark. Please make sure
        to adjust the
                     MEM_PER_NODE macro to suit your system before you
        run it :)
                     I'm happy to
                     provide additional details if needed.

                     Best
                     Joseph
                     --
                     Dipl.-Inf. Joseph Schuchart
                     High Performance Computing Center Stuttgart (HLRS)
                     Nobelstr. 19
                     D-70569 Stuttgart

                     Tel.: +49(0)711-68565890
                     Fax: +49(0)711-6856832
                     E-Mail: schuch...@hlrs.de
        <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
        <mailto:schuch...@hlrs.de>>

                     _______________________________________________
                     users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
                     <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>

                 _______________________________________________
                 users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
                 <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>



             --     Dipl.-Inf. Joseph Schuchart
             High Performance Computing Center Stuttgart (HLRS)
             Nobelstr. 19
             D-70569 Stuttgart

             Tel.: +49(0)711-68565890
             Fax: +49(0)711-6856832
             E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
             _______________________________________________
             users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
             <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>




-- Jeff Hammond
        jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
        <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
        http://jeffhammond.github.io/


        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>



-- Dipl.-Inf. Joseph Schuchart
    High Performance Computing Center Stuttgart (HLRS)
    Nobelstr. 19
    D-70569 Stuttgart

    Tel.: +49(0)711-68565890
    Fax: +49(0)711-6856832
    E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to