Thank you !
The patch fixes indeed my problem.
You saved me a lot of time.

Emmanuel

From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Wednesday, March 29, 2017 7:24 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] Segfault during a free in reduce_scatter using basic 
component


Folks,



mpirun -np 3 --oversubscribe --mca coll basic,libnbc,self --mca pml ob1 --mca 
btl tcp,self valgrind ./osu_reduce_scatter -m 65536:65536 -i 1 -x 0

issues some warnings such as

==5698== Invalid write of size 2
==5698==    at 0x4C2C0D3: memcpy@@GLIBC_2.14<mailto:memcpy@@GLIBC_2.14> 
(vg_replace_strmem.c:1018)
==5698==    by 0x633BD0F: opal_convertor_unpack (opal_convertor.c:318)
==5698==    by 0xEE400A4: mca_pml_ob1_recv_request_progress_match 
(pml_ob1_recvreq.c:865)
==5698==    by 0xEE40E2F: mca_pml_ob1_recv_req_start (pml_ob1_recvreq.c:1237)
==5698==    by 0xEE31720: mca_pml_ob1_recv (pml_ob1_irecv.c:134)
==5698==    by 0xF265303: mca_coll_basic_reduce_scatter_intra 
(coll_basic_reduce_scatter.c:303)
==5698==    by 0x4EE068C: PMPI_Reduce_scatter (preduce_scatter.c:131)
==5698==    by 0x401B7B: main (osu_reduce_scatter.c:127)
==5698==  Address 0x75cc554 is 0 bytes after a block of size 21,844 alloc'd
==5698==    at 0x4C29C5C: memalign (vg_replace_malloc.c:857)
==5698==    by 0x4C29D21: posix_memalign (vg_replace_malloc.c:1020)
==5698==    by 0x403D31: allocate_buffer (osu_coll.c:808)
==5698==    by 0x401997: main (osu_reduce_scatter.c:88)

that clearly indicates a buffer overflow that can produce unexpected results 
(such as a crash in free() in Emmanuel's environment)

after a bit of digging, i found that the bug was in the benchmark itself (!)


Emmanuel,

can you please give the attached patch a try ?


Cheers,

Gilles
On 3/29/2017 7:13 AM, George Bosilca wrote:
Emmanuel,

I tried with both 2.x and master (they are only syntactically different with 
regard to reduce_scatter) and I can't reproduce your issue. I run the OSU test 
with the following command line:

mpirun -n 97 --mca coll basic,libnbc,self --mca pml ob1 ./osu_reduce_scatter -m 
524288: -i 1 -x 0

I used the IB, TCP, vader and self BTLs.

  George.





On Tue, Mar 28, 2017 at 6:21 AM, Howard Pritchard 
<hpprit...@gmail.com<mailto:hpprit...@gmail.com>> wrote:
Hello Emmanuel,

Which version of Open MPI are you using?

Howard


2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL 
<emmanuel.bre...@atos.net<mailto:emmanuel.bre...@atos.net>>:
Hi,

We are working  on a portals4 components and we have found a bug  (causing a 
segmentation fault ) which must be  related to the coll/basic component.
Due to a lack of time, we cannot investigate further but this seems to be 
caused by a “free(disps);“ (around line 300 in coll_basic_reduce_scatter) in 
some specific situations. In our case it  happens on a osu_reduce_scatter (from 
the OSU microbenchmarks) with at least 97 procs for sizes bigger than 512Ko

Step to reproduce :
export OMPI_MCA_mtl=^portals4
export OMPI_MCA_btl=^portals4
export OMPI_MCA_coll=basic,libnbc,self,tuned
export OMPI_MCA_osc=^portals4
export OMPI_MCA_pml=ob1
mpirun -n 97 osu_reduce_scatter -m 524288:

( reducing the number of iterations with –i 1 –x 0 should keep the bug)
Our git branch is based on the v2.x branch and the files differ almost only on 
portals4 parts.

Could someone confirm this bug ?

Emmanuel BRELLE




_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel





_______________________________________________

devel mailing list

devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>

https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to