Thanks for clarification. I will go via new btl module path.
I used -btl self,tcp in past, to get things working (when dealing with exec and fork problems). So at the moment, Open MPI runs fine, we were able to run some test jobs, to get some preliminary performance measurements etc. Only the cleanup part was still problematic (and I lack(ed) the Open MPI knowledge to be able to understand, how _should_ be things working).

BR Justin

On 16. 12. 2015 12:20, Gilles Gouaillardet wrote:
Justin,

knem allows a process to write into the address space of an other process, to do zero copy. in the case of osv, threads can simply do a memcpy(), and I doubt knew is even available.
so a new btl that uses memcpy would be optimal on osv.

one option is to starts from the vader btl, and replace knem invocation with memcpy()
an other option could be to extend the self btl

but once again, this is for performance only, using tcp btl only should be enough to get things work.

Cheers,

Gilles

On Wednesday, December 16, 2015, Justin Cinkelj <justin.cink...@xlab.si <mailto:justin.cink...@xlab.si>> wrote:

    Vader is for intra-node communication only, right? So for
    inter-node communication some other mechanism will be used anyway.
    Why would be even better to write a new btl? To avoid memcpy (knem
    would use it, if I understand you correctly; I guess code assumes
    that multiple processes on same node have isolated address spaces).

    Fork + execve was one of first problems, yes. I replaced that with
    OSv specific calls (ignore fork, and instead of execve start given
    binary in new thread). The global variables required OSv
    modification - the guys from http://osv.io/ took care of that (I
    was surprised that at the end, the patches were really small and
    elegant). So while there are no real processes, new binary / ELF
    file is loaded at different address then the rest of OS - so it
    has separate global variables, and separate environ too. Other
    resources like file descriptors are still shared.

    BR Justin

    On 15. 12. 2015 14:55, Gilles Gouaillardet wrote:
    Justin,

    at first glance, vader should be symmetric (e.g.
    call opal_shmem_segment_dettach() instead of munmap()
    Nathan, can you please comment ?

    using tid instead of pid should also do the trick

    that being said, a more elegant approach would be to create a new
    module in the shmem framework
    basically, create = malloc, attach = return the malloc'ed
    address, detach = noop, destroy = free

    and an even better approach would be to write your own btl that
    replaces vader.
    basically, vader can use the knem module to write into an other
    process address space.
    since your os is thread only, knem invocation would become a
    simple memcpy.

    makes sense ?


    as a side note,
    ompi uses global variables, and orted forks and exec MPI tasks
    after setting some environment variables. it seems porting ompi
    to this new os was not so painful, and I would have expected some
    issues with the global variables, and some race conditions with
    the environment.
    did you already solve these issues ?

    Cheers,

    Gilles

    On Tuesday, December 15, 2015, Justin Cinkelj
    <justin.cink...@xlab.si
    <javascript:_e(%7B%7D,'cvml','justin.cink...@xlab.si');>> wrote:

        I'm trying to port Open MPI to OS with threads instead of
        processes. Currently, during MPI_Finalize, I get attempt to
        call munmap first with address of 0x200000c00000 and later
        0x200000c00008.

        mca_btl_vader_component_close():
        munmap (mca_btl_vader_component.my_segment,
        mca_btl_vader_component.segment_size)

        mca_btl_vader_component_init():
        if(MCA_BTL_VADER_XPMEM !=
        mca_btl_vader_component.single_copy_mechanism) {
          opal_shmem_segment_create (&component->seg_ds, sm_file,
        component->segment_size);
          component->my_segment = opal_shmem_segment_attach
        (&component->seg_ds);
        } else {
          mmap (NULL, component->segment_size, PROT_READ |
        PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        }

        But opal_shmem_segment_attach (from mmap module) ends with:
            /* update returned base pointer with an offset that hides
        our stuff */
            return (ds_buf->seg_base_addr +
        sizeof(opal_shmem_seg_hdr_t));

        So mca_btl_vader_component_close() should in that case call
        opal_shmem_segment_dettach() instead of munmap.
        Or actually, as at that point shmem_mmap module cleanup code
        is already done, vader could/should just skip cleanup part?

        Maybe I should ask first how does that setup/cleanup work on
        normal Linux system?
        Is mmap called twice, and vader and shmem_mmap module each
        uses different address (so vader munmap is indeed required in
        that case)?

        Second question.
        With two threads in one process, I got attempt to
        opal_shmem_segment_dettach() and munmap() on same mmap-ed
        address, from both threads. I 'fixed' that by replacing
        "ds_buf->seg_cpid = getpid()" with gettid(), and then each
        thread munmap-s only address allocated by itself. Is that
        correct? Or is it possible, that the second thread might
        still try to access data at that address?

        BR Justin

        _______________________________________________
        devel mailing list
        de...@open-mpi.org
        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/12/18417.php



    _______________________________________________
    devel mailing list
    de...@open-mpi.org
    <javascript:_e(%7B%7D,'cvml','de...@open-mpi.org');>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/12/18418.php



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/12/18427.php

Reply via email to