I was able to reproduce it (with the correct version of OMPI, aka. the v2.x
branch). The problem seems to be that we are lacking a part of
the fe68f230991 commit, that remove a free on a statically allocated array.
Here is the corresponding patch:

diff --git a/ompi/errhandler/errhandler_predefined.c
b/ompi/errhandler/errhandler_predefined.c
index 4d50611c12..54ac63553c 100644
--- a/ompi/errhandler/errhandler_predefined.c
+++ b/ompi/errhandler/errhandler_predefined.c
@@ -15,6 +15,7 @@
  * Copyright (c) 2010-2011 Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2012      Los Alamos National Security, LLC.
  *                         All rights reserved.
+ * Copyright (c) 2016      Intel, Inc.  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type,
     const char* const unknown_error_code = "Error code: %d (no associated
error message)";
     const char* const unknown_error = "Unknown error";
     const char* const unknown_prefix = "[?:?]";
+    bool generated = false;

     // these do not own what they point to; they're
     // here to avoid repeating expressions such as
@@ -211,6 +213,8 @@ static void backend_fatal_aggregate(char *type,
                 err_msg = NULL;
                 opal_output(0, "%s", "Could not write to err_msg");
                 opal_output(0, unknown_error_code, *error_code);
+            } else {
+                generated = true;
             }
         }
     }
@@ -256,7 +260,9 @@ static void backend_fatal_aggregate(char *type,
     }

     free(prefix);
-    free(err_msg);
+    if (generated) {
+        free(err_msg);
+    }
 }

 /*

  George.



On Thu, May 4, 2017 at 10:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Can you get a stack trace?
>
> > On May 4, 2017, at 6:44 PM, Dahai Guo <dahai....@gmail.com> wrote:
> >
> > Hi, George:
> >
> > attached is the ompi_info.  I built it on Power8 arch. The configure is
> also simple.
> >
> > ../configure --prefix=${installdir} \
> > --enable-orterun-prefix-by-default
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:45 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> > Dahai,
> >
> > You are right the segfault is unexpected. I can't replicate this on my
> mac. What architecture are you seeing this issue ? How was your OMPI
> compiled ?
> >
> > Please post the output of ompi_info.
> >
> > Thanks,
> > George.
> >
> >
> >
> > On Thu, May 4, 2017 at 5:42 PM, Dahai Guo <dahai....@gmail.com> wrote:
> > Those messages are what I like to see. But, there are some other error
> messages and core dump I don't like, as I attached in my previous email.  I
> think something might be wrong with errhandler in openmpi.  Similar thing
> happened for Bcast, etc
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm <hje...@me.com> wrote:
> > By default MPI errors are fatal and abort. The error message says it all:
> >
> > *** An error occurred in MPI_Reduce
> > *** reported by process [3645440001,0]
> > *** on communicator MPI_COMM_WORLD
> > *** MPI_ERR_COUNT: invalid count argument
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > *** and potentially your MPI job)
> >
> > If you want different behavior you have to change the default error
> handler on the communicator using MPI_Comm_set_errhandler. You can set it
> to MPI_ERRORS_RETURN and check the error code or you can create your own
> function. See MPI 3.1 Chapter 8.
> >
> > -Nathan
> >
> > On May 04, 2017, at 02:58 PM, Dahai Guo <dahai....@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> Using opemi 2.1,  the following code resulted in the core dump,
> although only a simple error msg was expected.  Any idea what is wrong?  It
> seemed related the errhandler somewhere.
> >>
> >>
> >> D.G.
> >>
> >>
> >>  *** An error occurred in MPI_Reduce
> >>  *** reported by process [3645440001,0]
> >>  *** on communicator MPI_COMM_WORLD
> >>  *** MPI_ERR_COUNT: invalid count argument
> >>  *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>  ***    and potentially your MPI job)
> >> ......
> >>
> >> [1,1]<stderr>:1000151c0000-1000151e0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:1000151e0000-100015250000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015250000-100015270000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015270000-1000152e0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:1000152e0000-100015300000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015300000-100015510000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015510000-100015530000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015530000-100015740000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015740000-100015760000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015760000-100015970000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015970000-100015990000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015990000-100015ba0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015ba0000-100015bc0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015bc0000-100015dd0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015dd0000-100015df0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015df0000-100016000000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016000000-100016020000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016020000-100016230000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016230000-100016250000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016250000-100016460000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016460000-100016470000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:3fffd4630000-3fffd46c0000 rw-p 00000000 00:00 0
>                 [stack]
> >> ------------------------------------------------------------
> --------------
> >>
> >> #include <stdlib.h>
> >> #include <stdio.h>
> >> #include <mpi.h>
> >> int main(int argc, char** argv)
> >> {
> >>
> >>     int r[1], s[1];
> >>     MPI_Init(&argc,&argv);
> >>
> >>     s[0] = 1;
> >>     r[0] = -1;
> >>     MPI_Reduce(s,r,-1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
> >>     printf("%d\n",r[0]);
> >>     MPI_Finalize();
> >> }
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > <opmi_info.txt>_______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to