I was able to reproduce it (with the correct version of OMPI, aka. the v2.x branch). The problem seems to be that we are lacking a part of the fe68f230991 commit, that remove a free on a statically allocated array. Here is the corresponding patch:
diff --git a/ompi/errhandler/errhandler_predefined.c b/ompi/errhandler/errhandler_predefined.c index 4d50611c12..54ac63553c 100644 --- a/ompi/errhandler/errhandler_predefined.c +++ b/ompi/errhandler/errhandler_predefined.c @@ -15,6 +15,7 @@ * Copyright (c) 2010-2011 Oak Ridge National Labs. All rights reserved. * Copyright (c) 2012 Los Alamos National Security, LLC. * All rights reserved. + * Copyright (c) 2016 Intel, Inc. All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type, const char* const unknown_error_code = "Error code: %d (no associated error message)"; const char* const unknown_error = "Unknown error"; const char* const unknown_prefix = "[?:?]"; + bool generated = false; // these do not own what they point to; they're // here to avoid repeating expressions such as @@ -211,6 +213,8 @@ static void backend_fatal_aggregate(char *type, err_msg = NULL; opal_output(0, "%s", "Could not write to err_msg"); opal_output(0, unknown_error_code, *error_code); + } else { + generated = true; } } } @@ -256,7 +260,9 @@ static void backend_fatal_aggregate(char *type, } free(prefix); - free(err_msg); + if (generated) { + free(err_msg); + } } /* George. On Thu, May 4, 2017 at 10:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Can you get a stack trace? > > > On May 4, 2017, at 6:44 PM, Dahai Guo <dahai....@gmail.com> wrote: > > > > Hi, George: > > > > attached is the ompi_info. I built it on Power8 arch. The configure is > also simple. > > > > ../configure --prefix=${installdir} \ > > --enable-orterun-prefix-by-default > > > > Dahai > > > > On Thu, May 4, 2017 at 4:45 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > Dahai, > > > > You are right the segfault is unexpected. I can't replicate this on my > mac. What architecture are you seeing this issue ? How was your OMPI > compiled ? > > > > Please post the output of ompi_info. > > > > Thanks, > > George. > > > > > > > > On Thu, May 4, 2017 at 5:42 PM, Dahai Guo <dahai....@gmail.com> wrote: > > Those messages are what I like to see. But, there are some other error > messages and core dump I don't like, as I attached in my previous email. I > think something might be wrong with errhandler in openmpi. Similar thing > happened for Bcast, etc > > > > Dahai > > > > On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm <hje...@me.com> wrote: > > By default MPI errors are fatal and abort. The error message says it all: > > > > *** An error occurred in MPI_Reduce > > *** reported by process [3645440001,0] > > *** on communicator MPI_COMM_WORLD > > *** MPI_ERR_COUNT: invalid count argument > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > > > If you want different behavior you have to change the default error > handler on the communicator using MPI_Comm_set_errhandler. You can set it > to MPI_ERRORS_RETURN and check the error code or you can create your own > function. See MPI 3.1 Chapter 8. > > > > -Nathan > > > > On May 04, 2017, at 02:58 PM, Dahai Guo <dahai....@gmail.com> wrote: > > > >> Hi, > >> > >> Using opemi 2.1, the following code resulted in the core dump, > although only a simple error msg was expected. Any idea what is wrong? It > seemed related the errhandler somewhere. > >> > >> > >> D.G. > >> > >> > >> *** An error occurred in MPI_Reduce > >> *** reported by process [3645440001,0] > >> *** on communicator MPI_COMM_WORLD > >> *** MPI_ERR_COUNT: invalid count argument > >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort, > >> *** and potentially your MPI job) > >> ...... > >> > >> [1,1]<stderr>:1000151c0000-1000151e0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:1000151e0000-100015250000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015250000-100015270000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015270000-1000152e0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:1000152e0000-100015300000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015300000-100015510000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015510000-100015530000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015530000-100015740000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015740000-100015760000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015760000-100015970000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015970000-100015990000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015990000-100015ba0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015ba0000-100015bc0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015bc0000-100015dd0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015dd0000-100015df0000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100015df0000-100016000000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100016000000-100016020000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100016020000-100016230000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100016230000-100016250000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100016250000-100016460000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:100016460000-100016470000 rw-p 00000000 00:00 0 > >> [1,1]<stderr>:3fffd4630000-3fffd46c0000 rw-p 00000000 00:00 0 > [stack] > >> ------------------------------------------------------------ > -------------- > >> > >> #include <stdlib.h> > >> #include <stdio.h> > >> #include <mpi.h> > >> int main(int argc, char** argv) > >> { > >> > >> int r[1], s[1]; > >> MPI_Init(&argc,&argv); > >> > >> s[0] = 1; > >> r[0] = -1; > >> MPI_Reduce(s,r,-1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); > >> printf("%d\n",r[0]); > >> MPI_Finalize(); > >> } > >> > >> _______________________________________________ > >> devel mailing list > >> devel@lists.open-mpi.org > >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > > > _______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > <opmi_info.txt>_______________________________________________ > > devel mailing list > > devel@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel