Re: [OMPI devel] orterun busted

2017-06-23 Thread r...@open-mpi.org
Odd - I guess my machine is just consistently lucky, as was the CI’s when this 
went thru. The problem field is actually stale - we haven’t used it in years - 
so I simply removed it from orte_process_info.

https://github.com/open-mpi/ompi/pull/3741 


Should fix the problem.

> On Jun 23, 2017, at 3:38 AM, George Bosilca  wrote:
> 
> Ralph,
> 
> I got consistent segfaults during the infrastructure tearing down in the 
> orterun (I noticed them on a OSX). After digging a little bit it turns out 
> that the opal_buffet_t class has been cleaned-up in orte_finalize before 
> orte_proc_info_finalize is called, leading to calling the destructors into a 
> randomly initialized memory. If I change the order of the teardown to move 
> orte_proc_info_finalize before orte_finalize things work better, but I still 
> get a very annoying warning about a "Bad file descriptor in select".
> 
> Any better fix ?
> 
> George.
> 
> PS: Here is the patch I am currently using to get rid of the segfaults
> 
> diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
> index 85aba0a0f3..506b931d35 100644
> --- a/orte/tools/orterun/orterun.c
> +++ b/orte/tools/orterun/orterun.c
> @@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
>   DONE:
>  /* cleanup and leave */
>  orte_submit_finalize();
> -orte_finalize();
> -orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
>  /* cleanup the process info */
>  orte_proc_info_finalize();
> +orte_finalize();
> +orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
> 
>  if (orte_debug_flag) {
>  fprintf(stderr, "exiting with status %d\n", orte_exit_status);
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Abstraction violation!

2017-06-23 Thread Jeff Squyres (jsquyres)
FWIW: mpi.h is created at the end of configure (it's an AC_CONFIG_HEADERS file).


> On Jun 22, 2017, at 9:37 PM, Barrett, Brian via devel 
>  wrote:
> 
> Thanks, Nathan.
> 
> There’s no mpi.h available on the PR builder hosts, so something works out.  
> Haven’t thought through that path, however.
> 
> Brian
> 
>> On Jun 22, 2017, at 6:04 PM, Nathan Hjelm  wrote:
>> 
>> I have a fix I am working on. Will open a PR tomorrow morning.
>> 
>> -Nathan
>> 
>>> On Jun 22, 2017, at 6:11 PM, r...@open-mpi.org wrote:
>>> 
>>> Here’s something even weirder. You cannot build that file unless mpi.h 
>>> already exists, which it won’t until you build the MPI layer. So apparently 
>>> what is happening is that we somehow pickup a pre-existing version of mpi.h 
>>> and use that to build the file?
>>> 
>>> Checking around, I find that all my available machines have an mpi.h 
>>> somewhere in the default path because we always install _something_. I 
>>> wonder if our master would fail in a distro that didn’t have an MPI 
>>> installed...
>>> 
 On Jun 22, 2017, at 5:02 PM, r...@open-mpi.org wrote:
 
 It apparently did come in that way. We just never test -no-ompi and so it 
 wasn’t discovered until a downstream project tried to update. Then...boom.
 
 
> On Jun 22, 2017, at 4:07 PM, Barrett, Brian via devel 
>  wrote:
> 
> I’m confused; looking at history, there’s never been a time when 
> opal/util/info.c hasn’t included mpi.h.  That seems odd, but so does info 
> being in opal.
> 
> Brian
> 
>> On Jun 22, 2017, at 3:46 PM, r...@open-mpi.org wrote:
>> 
>> I don’t understand what someone was thinking, but you CANNOT #include 
>> “mpi.h” in opal/util/info.c. It has broken pretty much every downstream 
>> project.
>> 
>> Please fix this!
>> Ralph
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
 
 ___
 devel mailing list
 devel@lists.open-mpi.org
 https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] orterun busted

2017-06-23 Thread George Bosilca
Ralph,

I got consistent segfaults during the infrastructure tearing down in the
orterun (I noticed them on a OSX). After digging a little bit it turns out
that the opal_buffet_t class has been cleaned-up in orte_finalize before
orte_proc_info_finalize is called, leading to calling the destructors into
a randomly initialized memory. If I change the order of the teardown to
move orte_proc_info_finalize before orte_finalize things work better, but I
still get a very annoying warning about a "Bad file descriptor in select".

Any better fix ?

George.

PS: Here is the patch I am currently using to get rid of the segfaults

diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c
index 85aba0a0f3..506b931d35 100644
--- a/orte/tools/orterun/orterun.c
+++ b/orte/tools/orterun/orterun.c
@@ -222,10 +222,10 @@ int orterun(int argc, char *argv[])
  DONE:
 /* cleanup and leave */
 orte_submit_finalize();
-orte_finalize();
-orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);
 /* cleanup the process info */
 orte_proc_info_finalize();
+orte_finalize();
+orte_session_dir_cleanup(ORTE_JOBID_WILDCARD);

 if (orte_debug_flag) {
 fprintf(stderr, "exiting with status %d\n", orte_exit_status);
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files in /tmp

2017-06-23 Thread Christoph Niethammer
Hi Howard,

You find the pull request under https://github.com/open-mpi/ompi/pull/3739

Best
Christoph

- Original Message -
From: "Howard Pritchard" 
To: "Open MPI Developers" 
Sent: Thursday, June 22, 2017 4:42:14 PM
Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files 
in /tmp

Hi Chris 

Please go ahead and open a PR for master and I'll open corresponding ones for 
the release branches. 

Howard 

Christoph Niethammer < [ mailto:nietham...@hlrs.de | nietham...@hlrs.de ] > 
schrieb am Do. 22. Juni 2017 um 01:10: 


Hi Howard, 

Sorry, missed the new license policy. I added a Sign-off now. 
Shall I open a pull request? 

Best 
Christoph 

- Original Message - 
From: "Howard Pritchard" < [ mailto:hpprit...@gmail.com | hpprit...@gmail.com ] 
> 
To: "Open MPI Developers" < [ mailto:devel@lists.open-mpi.org | 
devel@lists.open-mpi.org ] > 
Sent: Wednesday, June 21, 2017 5:57:05 PM 
Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files 
in /tmp 

Hi Chris, 

Sorry for being a bit picky, but could you add a sign-off to the commit 
message? 
I'm not suppose to manually add it for you. 

Thanks, 

Howard 


2017-06-21 9:45 GMT-06:00 Howard Pritchard < [ mailto: [ 
mailto:hpprit...@gmail.com | hpprit...@gmail.com ] | [ 
mailto:hpprit...@gmail.com | hpprit...@gmail.com ] ] > : 



Hi Chris, 

Thanks very much for the patch! 

Howard 


2017-06-21 9:43 GMT-06:00 Christoph Niethammer < [ mailto: [ 
mailto:nietham...@hlrs.de | nietham...@hlrs.de ] | [ mailto:nietham...@hlrs.de 
| nietham...@hlrs.de ] ] > : 


Hello Ralph, 

Thanks for the update on this issue. 

I used the latest master (c38866eb3929339147259a3a46c6fc815720afdb). 

The behaviour is still the same: aborting before MPI_File_close leaves 
/tmp/OMPI_*.sm files. 
These are not removed by your updated orte-clean. 

I now seeked for the origin of these files and it seems to be in 
ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:154 
where also a left over TODO note some lines above is mentioning the need for a 
correct directory. 

I would suggest updating the path there to be under the 
 directory which is cleaned by orte-clean, 
see 

[ [ 
https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
 | 
https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
 ] | [ 
https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
 | 
https://github.com/cniethammer/ompi/commit/2aedf6134813299803628e7d6856a3b781542c02
 ] ] 

Best 
Christoph 

- Original Message - 
From: "Ralph Castain" < [ mailto: [ mailto:r...@open-mpi.org | 
r...@open-mpi.org ] | [ mailto:r...@open-mpi.org | r...@open-mpi.org ] ] > 
To: "Open MPI Developers" < [ mailto: [ mailto:devel@lists.open-mpi.org | 
devel@lists.open-mpi.org ] | [ mailto:devel@lists.open-mpi.org | 
devel@lists.open-mpi.org ] ] > 
Sent: Wednesday, June 21, 2017 4:33:29 AM 
Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O files 
in /tmp 

I updated orte-clean in master, and for v3.0, so it cleans up all both current 
and legacy session directory files as well as any pmix artifacts. I don’t see 
any files named OMPI_*.sm, though that might be something from v2.x? I don’t 
recall us ever making files of that name before - anything we make should be 
under the session directory, not directly in /tmp. 

> On May 9, 2017, at 2:10 AM, Christoph Niethammer < [ mailto: [ 
> mailto:nietham...@hlrs.de | nietham...@hlrs.de ] | [ 
> mailto:nietham...@hlrs.de | nietham...@hlrs.de ] ] > wrote: 
> 
> Hi, 
> 
> I am using Open MPI 2.1.0. 
> 
> Best 
> Christoph 
> 
> - Original Message - 
> From: "Ralph Castain" < [ mailto: [ mailto:r...@open-mpi.org | 
> r...@open-mpi.org ] | [ mailto:r...@open-mpi.org | r...@open-mpi.org ] ] > 
> To: "Open MPI Developers" < [ mailto: [ mailto:devel@lists.open-mpi.org | 
> devel@lists.open-mpi.org ] | [ mailto:devel@lists.open-mpi.org | 
> devel@lists.open-mpi.org ] ] > 
> Sent: Monday, May 8, 2017 6:28:42 PM 
> Subject: Re: [OMPI devel] orte-clean not cleaning left over temporary I/O 
> files in /tmp 
> 
> What version of OMPI are you using? 
> 
>> On May 8, 2017, at 8:56 AM, Christoph Niethammer < [ mailto: [ 
>> mailto:nietham...@hlrs.de | nietham...@hlrs.de ] | [ 
>> mailto:nietham...@hlrs.de | nietham...@hlrs.de ] ] > wrote: 
>> 
>> Hello 
>> 
>> According to the manpage "...orte-clean attempts to clean up any processes 
>> and files left over from Open MPI jobs that were run in the past as well as 
>> any currently running jobs. This includes OMPI infrastructure and helper 
>> commands, any processes that were spawned as part of the job, and any 
>> temporary files...". 
>> 
>> If I now have a program which calls MPI_File_open, MPI_File_write and 
>> MPI_Abort() in order, I get left over files /tmp/OMPI_*.sm. 
>> Running orte-clean does not remove them. 
>> 
>> Is this a bug or a feature? 
>> 
>> Best 
>> Christoph Niet