Re: [OMPI devel] orterun busted

2017-06-23 Thread r...@open-mpi.org
Odd - I guess my machine is just consistently lucky, as was the CI’s when this 
went thru. The problem field is actually stale - we haven’t used it in years - 
so I simply removed it from orte_process_info.

https://github.com/open-mpi/ompi/pull/3741 


Should fix the problem.

> On Jun 23, 2017, at 3:38 AM, George Bosilca  wrote:
> 
> Ralph,
> 
> I got consistent segfaults during the infrastructure tearing down in the 
> orterun (I noticed them on a OSX). After digging a little bit it turns out 
> that the opal_buffet_t class has been cleaned-up in orte_finalize before 
> orte_proc_info_finalize is called, leading to calling the destructors into a 
> randomly initialized memory. If I change the order of the teardown to move 
> orte_proc_info_finalize before orte_finalize things work better, but I still 
> get a very annoying warning about a "Bad file descriptor in select".
> 
> Any better fix ?
> 
> George.
> 
> PS: Here is the patch I am currently using to get rid of the segfaults
> 
> diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
> index 85aba0a0f3..506b931d35 100644
> --- a/orte/tools/orterun/orterun.c
> +++ b/orte/tools/orterun/orterun.c
> @@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
>   DONE:
>  /* cleanup and leave */
>  orte_submit_finalize();
> -orte_finalize();
> -orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
>  /* cleanup the process info */
>  orte_proc_info_finalize();
> +orte_finalize();
> +orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
> 
>  if (orte_debug_flag) {
>  fprintf(stderr, "exiting with status %d\n", orte_exit_status);
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] orterun busted

2017-06-23 Thread George Bosilca
Ralph,

I got consistent segfaults during the infrastructure tearing down in the
orterun (I noticed them on a OSX). After digging a little bit it turns out
that the opal_buffet_t class has been cleaned-up in orte_finalize before
orte_proc_info_finalize is called, leading to calling the destructors into
a randomly initialized memory. If I change the order of the teardown to
move orte_proc_info_finalize before orte_finalize things work better, but I
still get a very annoying warning about a "Bad file descriptor in select".

Any better fix ?

George.

PS: Here is the patch I am currently using to get rid of the segfaults

diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
index 85aba0a0f3..506b931d35 100644
--- a/orte/tools/orterun/orterun.c
+++ b/orte/tools/orterun/orterun.c
@@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
  DONE:
 /* cleanup and leave */
 orte_submit_finalize();
-orte_finalize();
-orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
 /* cleanup the process info */
 orte_proc_info_finalize();
+orte_finalize();
+orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);

 if (orte_debug_flag) {
 fprintf(stderr, "exiting with status %d\n", orte_exit_status);
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel