Folks,

mpirun -np 3 --oversubscribe --mca coll basic,libnbc,self --mca pml ob1 --mca btl tcp,self valgrind ./osu_reduce_scatter -m 65536:65536 -i 1 -x 0


issues some warnings such as

==5698== Invalid write of size 2
==5698==    at 0x4C2C0D3: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018)
==5698==    by 0x633BD0F: opal_convertor_unpack (opal_convertor.c:318)
==5698== by 0xEE400A4: mca_pml_ob1_recv_request_progress_match (pml_ob1_recvreq.c:865) ==5698== by 0xEE40E2F: mca_pml_ob1_recv_req_start (pml_ob1_recvreq.c:1237)
==5698==    by 0xEE31720: mca_pml_ob1_recv (pml_ob1_irecv.c:134)
==5698== by 0xF265303: mca_coll_basic_reduce_scatter_intra (coll_basic_reduce_scatter.c:303)
==5698==    by 0x4EE068C: PMPI_Reduce_scatter (preduce_scatter.c:131)
==5698==    by 0x401B7B: main (osu_reduce_scatter.c:127)
==5698==  Address 0x75cc554 is 0 bytes after a block of size 21,844 alloc'd
==5698==    at 0x4C29C5C: memalign (vg_replace_malloc.c:857)
==5698==    by 0x4C29D21: posix_memalign (vg_replace_malloc.c:1020)
==5698==    by 0x403D31: allocate_buffer (osu_coll.c:808)
==5698==    by 0x401997: main (osu_reduce_scatter.c:88)

that clearly indicates a buffer overflow that can produce unexpected results (such as a crash in free() in Emmanuel's environment)

after a bit of digging, i found that the bug was in the benchmark itself (!)


Emmanuel,

can you please give the attached patch a try ?


Cheers,

Gilles

On 3/29/2017 7:13 AM, George Bosilca wrote:
Emmanuel,

I tried with both 2.x and master (they are only syntactically different with regard to reduce_scatter) and I can't reproduce your issue. I run the OSU test with the following command line:

mpirun -n 97 --mca coll basic,libnbc,self --mca pml ob1 ./osu_reduce_scatter -m 524288: -i 1 -x 0

I used the IB, TCP, vader and self BTLs.

  George.





On Tue, Mar 28, 2017 at 6:21 AM, Howard Pritchard <hpprit...@gmail.com <mailto:hpprit...@gmail.com>> wrote:

    Hello Emmanuel,

    Which version of Open MPI are you using?

    Howard


    2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL
    <emmanuel.bre...@atos.net <mailto:emmanuel.bre...@atos.net>>:

        Hi,
        We are working  on a portals4 components and we have found a
        bug (causing a segmentation fault ) which must be  related to
        the coll/basic component.
        Due to a lack of time, we cannot investigate further but this
        seems to be caused by a “free(disps);“ (around line 300 in
        coll_basic_reduce_scatter) in some specific situations. In our
        case it happens on a osu_reduce_scatter (from the OSU
        microbenchmarks) with at least 97 procs for sizes bigger than
        512Ko
        Step to reproduce :
        export OMPI_MCA_mtl=^portals4
        export OMPI_MCA_btl=^portals4
        export OMPI_MCA_coll=basic,libnbc,self,tuned
        export OMPI_MCA_osc=^portals4
        export OMPI_MCA_pml=ob1
        mpirun -n 97 osu_reduce_scatter -m 524288:
        ( reducing the number of iterations with –i 1 –x 0 should keep
        the bug)
        Our git branch is based on the v2.x branch and the files
        differ almost only on portals4 parts.
        Could someone confirm this bug ?
        Emmanuel BRELLE

        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>



    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>




_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

diff -ruN orig/osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c 
osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c
--- orig/osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c 
2016-09-09 03:39:29.000000000 +0900
+++ osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c      
2017-03-29 14:12:04.155264761 +0900
@@ -84,7 +84,7 @@
     }
     set_buffer(sendbuf, options.accel, 1, bufsize);
 
-    bufsize = sizeof(float)*((options.max_message_size/numprocs + 
1)/sizeof(float));
+    bufsize = 
sizeof(float)*(options.max_message_size/numprocs/sizeof(float)+1);
     if (allocate_buffer((void**)&recvbuf, bufsize,
                 options.accel)) {
         fprintf(stderr, "Could Not Allocate Memory [rank %d]\n", rank);
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to