Thanks, George. It works! In addition, the following code would also cause a problem. checking if count ==0 should be moved to the beginning of the code ompi/mpi/c/reduce.c and ireduce.c, or fix it in other way.
Dahai #include <stdlib.h> #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int r[1], s[1]; MPI_Init(&argc,&argv); s[0] = 1; r[0] = -1; MPI_Reduce(s,r,0,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); printf("%d\n",r[0]); MPI_Reduce(NULL,NULL,0,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); MPI_Finalize(); } ~ On Thu, May 4, 2017 at 9:18 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > I was able to reproduce it (with the correct version of OMPI, aka. the > v2.x branch). The problem seems to be that we are lacking a part of > the fe68f230991 commit, that remove a free on a statically allocated array. > Here is the corresponding patch: > > diff --git a/ompi/errhandler/errhandler_predefined.c > b/ompi/errhandler/errhandler_predefined.c > index 4d50611c12..54ac63553c 100644 > --- a/ompi/errhandler/errhandler_predefined.c > +++ b/ompi/errhandler/errhandler_predefined.c > @@ -15,6 +15,7 @@ > * Copyright (c) 2010-2011 Oak Ridge National Labs. All rights reserved. > * Copyright (c) 2012 Los Alamos National Security, LLC. > * All rights reserved. > + * Copyright (c) 2016 Intel, Inc. All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > @@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type, > const char* const unknown_error_code = "Error code: %d (no associated > error message)"; > const char* const unknown_error = "Unknown error"; > const char* const unknown_prefix = "[?:?]"; > + bool generated = false; > > // these do not own what they point to; they're > // here to avoid repeating expressions such as > @@ -211,6 +213,8 @@ static void backend_fatal_aggregate(char *type, > err_msg = NULL; > opal_output(0, "%s", "Could not write to err_msg"); > opal_output(0, unknown_error_code, *error_code); > + } else { > + generated = true; > } > } > } > @@ -256,7 +260,9 @@ static void backend_fatal_aggregate(char *type, > } > > free(prefix); > - free(err_msg); > + if (generated) { > + free(err_msg); > + } > } > > /* > > George. > > > > On Thu, May 4, 2017 at 10:03 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> Can you get a stack trace? >> >> > On May 4, 2017, at 6:44 PM, Dahai Guo <dahai....@gmail.com> wrote: >> > >> > Hi, George: >> > >> > attached is the ompi_info. I built it on Power8 arch. The configure is >> also simple. >> > >> > ../configure --prefix=${installdir} \ >> > --enable-orterun-prefix-by-default >> > >> > Dahai >> > >> > On Thu, May 4, 2017 at 4:45 PM, George Bosilca <bosi...@icl.utk.edu> >> wrote: >> > Dahai, >> > >> > You are right the segfault is unexpected. I can't replicate this on my >> mac. What architecture are you seeing this issue ? How was your OMPI >> compiled ? >> > >> > Please post the output of ompi_info. >> > >> > Thanks, >> > George. >> > >> > >> > >> > On Thu, May 4, 2017 at 5:42 PM, Dahai Guo <dahai....@gmail.com> wrote: >> > Those messages are what I like to see. But, there are some other error >> messages and core dump I don't like, as I attached in my previous email. I >> think something might be wrong with errhandler in openmpi. Similar thing >> happened for Bcast, etc >> > >> > Dahai >> > >> > On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm <hje...@me.com> wrote: >> > By default MPI errors are fatal and abort. The error message says it >> all: >> > >> > *** An error occurred in MPI_Reduce >> > *** reported by process [3645440001,0] >> > *** on communicator MPI_COMM_WORLD >> > *** MPI_ERR_COUNT: invalid count argument >> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, >> > *** and potentially your MPI job) >> > >> > If you want different behavior you have to change the default error >> handler on the communicator using MPI_Comm_set_errhandler. You can set it >> to MPI_ERRORS_RETURN and check the error code or you can create your own >> function. See MPI 3.1 Chapter 8. >> > >> > -Nathan >> > >> > On May 04, 2017, at 02:58 PM, Dahai Guo <dahai....@gmail.com> wrote: >> > >> >> Hi, >> >> >> >> Using opemi 2.1, the following code resulted in the core dump, >> although only a simple error msg was expected. Any idea what is wrong? It >> seemed related the errhandler somewhere. >> >> >> >> >> >> D.G. >> >> >> >> >> >> *** An error occurred in MPI_Reduce >> >> *** reported by process [3645440001,0] >> >> *** on communicator MPI_COMM_WORLD >> >> *** MPI_ERR_COUNT: invalid count argument >> >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >> abort, >> >> *** and potentially your MPI job) >> >> ...... >> >> >> >> [1,1]<stderr>:1000151c0000-1000151e0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:1000151e0000-100015250000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015250000-100015270000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015270000-1000152e0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:1000152e0000-100015300000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015300000-100015510000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015510000-100015530000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015530000-100015740000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015740000-100015760000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015760000-100015970000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015970000-100015990000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015990000-100015ba0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015ba0000-100015bc0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015bc0000-100015dd0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015dd0000-100015df0000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100015df0000-100016000000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100016000000-100016020000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100016020000-100016230000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100016230000-100016250000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100016250000-100016460000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:100016460000-100016470000 rw-p 00000000 00:00 0 >> >> [1,1]<stderr>:3fffd4630000-3fffd46c0000 rw-p 00000000 00:00 0 >> [stack] >> >> ------------------------------------------------------------ >> -------------- >> >> >> >> #include <stdlib.h> >> >> #include <stdio.h> >> >> #include <mpi.h> >> >> int main(int argc, char** argv) >> >> { >> >> >> >> int r[1], s[1]; >> >> MPI_Init(&argc,&argv); >> >> >> >> s[0] = 1; >> >> r[0] = -1; >> >> MPI_Reduce(s,r,-1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); >> >> printf("%d\n",r[0]); >> >> MPI_Finalize(); >> >> } >> >> >> >> _______________________________________________ >> >> devel mailing list >> >> devel@lists.open-mpi.org >> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > >> > _______________________________________________ >> > devel mailing list >> > devel@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > >> > >> > _______________________________________________ >> > devel mailing list >> > devel@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > >> > >> > _______________________________________________ >> > devel mailing list >> > devel@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > >> > <opmi_info.txt>_______________________________________________ >> > devel mailing list >> > devel@lists.open-mpi.org >> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel