Folks,
mpirun -np 3 --oversubscribe --mca coll basic,libnbc,self --mca pml ob1
--mca btl tcp,self valgrind ./osu_reduce_scatter -m 65536:65536 -i 1 -x 0
issues some warnings such as
==5698== Invalid write of size 2
==5698== at 0x4C2C0D3: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018)
==5698== by 0x633BD0F: opal_convertor_unpack (opal_convertor.c:318)
==5698== by 0xEE400A4: mca_pml_ob1_recv_request_progress_match
(pml_ob1_recvreq.c:865)
==5698== by 0xEE40E2F: mca_pml_ob1_recv_req_start
(pml_ob1_recvreq.c:1237)
==5698== by 0xEE31720: mca_pml_ob1_recv (pml_ob1_irecv.c:134)
==5698== by 0xF265303: mca_coll_basic_reduce_scatter_intra
(coll_basic_reduce_scatter.c:303)
==5698== by 0x4EE068C: PMPI_Reduce_scatter (preduce_scatter.c:131)
==5698== by 0x401B7B: main (osu_reduce_scatter.c:127)
==5698== Address 0x75cc554 is 0 bytes after a block of size 21,844 alloc'd
==5698== at 0x4C29C5C: memalign (vg_replace_malloc.c:857)
==5698== by 0x4C29D21: posix_memalign (vg_replace_malloc.c:1020)
==5698== by 0x403D31: allocate_buffer (osu_coll.c:808)
==5698== by 0x401997: main (osu_reduce_scatter.c:88)
that clearly indicates a buffer overflow that can produce unexpected
results (such as a crash in free() in Emmanuel's environment)
after a bit of digging, i found that the bug was in the benchmark itself (!)
Emmanuel,
can you please give the attached patch a try ?
Cheers,
Gilles
On 3/29/2017 7:13 AM, George Bosilca wrote:
Emmanuel,
I tried with both 2.x and master (they are only syntactically
different with regard to reduce_scatter) and I can't reproduce your
issue. I run the OSU test with the following command line:
mpirun -n 97 --mca coll basic,libnbc,self --mca pml ob1
./osu_reduce_scatter -m 524288: -i 1 -x 0
I used the IB, TCP, vader and self BTLs.
George.
On Tue, Mar 28, 2017 at 6:21 AM, Howard Pritchard <hpprit...@gmail.com
<mailto:hpprit...@gmail.com>> wrote:
Hello Emmanuel,
Which version of Open MPI are you using?
Howard
2017-03-28 3:38 GMT-06:00 BRELLE, EMMANUEL
<emmanuel.bre...@atos.net <mailto:emmanuel.bre...@atos.net>>:
Hi,
We are working on a portals4 components and we have found a
bug (causing a segmentation fault ) which must be related to
the coll/basic component.
Due to a lack of time, we cannot investigate further but this
seems to be caused by a “free(disps);“ (around line 300 in
coll_basic_reduce_scatter) in some specific situations. In our
case it happens on a osu_reduce_scatter (from the OSU
microbenchmarks) with at least 97 procs for sizes bigger than
512Ko
Step to reproduce :
export OMPI_MCA_mtl=^portals4
export OMPI_MCA_btl=^portals4
export OMPI_MCA_coll=basic,libnbc,self,tuned
export OMPI_MCA_osc=^portals4
export OMPI_MCA_pml=ob1
mpirun -n 97 osu_reduce_scatter -m 524288:
( reducing the number of iterations with –i 1 –x 0 should keep
the bug)
Our git branch is based on the v2.x branch and the files
differ almost only on portals4 parts.
Could someone confirm this bug ?
Emmanuel BRELLE
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
diff -ruN orig/osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c
osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c
--- orig/osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c
2016-09-09 03:39:29.000000000 +0900
+++ osu-micro-benchmarks-5.3.2/mpi/collective/osu_reduce_scatter.c
2017-03-29 14:12:04.155264761 +0900
@@ -84,7 +84,7 @@
}
set_buffer(sendbuf, options.accel, 1, bufsize);
- bufsize = sizeof(float)*((options.max_message_size/numprocs +
1)/sizeof(float));
+ bufsize =
sizeof(float)*(options.max_message_size/numprocs/sizeof(float)+1);
if (allocate_buffer((void**)&recvbuf, bufsize,
options.accel)) {
fprintf(stderr, "Could Not Allocate Memory [rank %d]\n", rank);
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel