Thank you ! The patch fixes indeed my problem. You saved me a lot of time. Emmanuel
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet Sent: Wednesday, March 29, 2017 7:24 AM To: Open MPI Developers Subject: Re: [OMPI devel] Segfault during a free in reduce_scatter using basic component Folks, mpirun -np 3 --oversubscribe --mca coll basic,libnbc,self --mca pml ob1 --mca btl tcp,self valgrind ./osu_reduce_scatter -m 65536:65536 -i 1 -x 0 issues some warnings such as ==5698== Invalid write of size 2 ==5698== at 0x4C2C0D3: memcpy@@GLIBC_2.14<mailto:memcpy@@GLIBC_2.14> (vg_replace_strmem.c:1018) ==5698== by 0x633BD0F: opal_convertor_unpack (opal_convertor.c:318) ==5698== by 0xEE400A4: mca_pml_ob1_recv_request_progress_match (pml_ob1_recvreq.c:865) ==5698== by 0xEE40E2F: mca_pml_ob1_recv_req_start (pml_ob1_recvreq.c:1237) ==5698== by 0xEE31720: mca_pml_ob1_recv (pml_ob1_irecv.c:134) ==5698== by 0xF265303: mca_coll_basic_reduce_scatter_intra (coll_basic_reduce_scatter.c:303) ==5698== by 0x4EE068C: PMPI_Reduce_scatter (preduce_scatter.c:131) ==5698== by 0x401B7B: main (osu_reduce_scatter.c:127) ==5698== Address 0x75cc554 is 0 bytes after a block of size 21,844 alloc'd ==5698== at 0x4C29C5C: memalign (vg_replace_malloc.c:857) ==5698== by 0x4C29D21: posix_memalign (vg_replace_malloc.c:1020) ==5698== by 0x403D31: allocate_buffer (osu_coll.c:808) ==5698== by 0x401997: main (osu_reduce_scatter.c:88) that clearly indicates a buffer overflow that can produce unexpected results (such as a crash in free() in Emmanuel's environment) after a bit of digging, i found that the bug was in the benchmark itself (!) Emmanuel, can you please give the attached patch a try ? Cheers, Gilles On 3/29/2017 7:13 AM, George Bosilca wrote: Emmanuel, I tried with both 2.x and master (they are only syntactically different with regard to reduce_scatter) and I can't reproduce your issue. I run the OSU test with the following command line: mpirun -n 97 --mca coll basic,libnbc,self --mca pml ob1 ./osu_reduce_scatter -m 524288: -i 1 -x 0 I used the IB, TCP, vader and self BTLs. George. On Tue, Mar 28, 2017 at 6:21 AM, Howard Pritchard <hpprit...@gmail.com<mailto:hpprit...@gmail.com>> wrote: Hello Emmanuel, Which version of Open MPI are you using? Howard 2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL <emmanuel.bre...@atos.net<mailto:emmanuel.bre...@atos.net>>: Hi, We are working on a portals4 components and we have found a bug (causing a segmentation fault ) which must be related to the coll/basic component. Due to a lack of time, we cannot investigate further but this seems to be caused by a “free(disps);“ (around line 300 in coll_basic_reduce_scatter) in some specific situations. In our case it happens on a osu_reduce_scatter (from the OSU microbenchmarks) with at least 97 procs for sizes bigger than 512Ko Step to reproduce : export OMPI_MCA_mtl=^portals4 export OMPI_MCA_btl=^portals4 export OMPI_MCA_coll=basic,libnbc,self,tuned export OMPI_MCA_osc=^portals4 export OMPI_MCA_pml=ob1 mpirun -n 97 osu_reduce_scatter -m 524288: ( reducing the number of iterations with –i 1 –x 0 should keep the bug) Our git branch is based on the v2.x branch and the files differ almost only on portals4 parts. Could someone confirm this bug ? Emmanuel BRELLE _______________________________________________ devel mailing list devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel