Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Ralph Castain
Could be - let me investigate this weekend. Thanks for all that parsing!!! > On Oct 23, 2015, at 5:00 PM, Mark Santcroos > wrote: > > Is this the culprit? > > 'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4', > 'state:base:track_procs called for proc

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
Is this the culprit? 'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4', 'state:base:track_procs called for proc [[8679,2],0] state RUNNING', That seems to be out of order for the hanging processes.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
> On 21 Oct 2015, at 2:50 , Ralph Castain wrote: > Can you do me a favor? Hi Ralph, It required some parsing-fu, but here you go! :-) Three text files attached. One is the raw log, the second is output from my parser script and the third is the output of pstree after it

Re: [OMPI devel] How is session dir used?

2015-10-23 Thread Ralph Castain
No, you won’t see the change to the daemon-to-proc connection coming to the 1.10 series. It will only be upstream from that one, starting with 2.0 > On Oct 23, 2015, at 9:15 AM, Justin Cinkelj wrote: > > > > - Original Message - >> From: "Justin Cinkelj"

Re: [OMPI devel] How is session dir used?

2015-10-23 Thread Justin Cinkelj
- Original Message - > From: "Justin Cinkelj" > To: "Open MPI Developers" > Sent: Friday, October 23, 2015 5:59:43 PM > Subject: Re: [OMPI devel] How is session dir used? > > Shared memory file is used by mpi_program only, and not by orted,

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread George Bosilca
Each module has the opportunity to provide an ft_event function, that is supposedly called when a change in the module behavior is necessary. Thus, it is relatively easy to let the BTL knows about the fact that a particular destination process will migrate to a new location. George. On Fri,

Re: [OMPI devel] How is session dir used?

2015-10-23 Thread Ralph Castain
The session dir is also used by the shared memory system for its backing file, so you may need it if you plan to run more than one proc in a VM. This has been one of the sticking points for VM/container-based operations. As for the orted: your description is pretty close. The socket you mention

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Jeff Squyres (jsquyres)
On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet wrote: > > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. To be specific: the C/R support was removed from the v2.x branch because it is stale / not working. The support is

Re: [OMPI devel] mtt-submit, etc.

2015-10-23 Thread Jeff Squyres (jsquyres)
I see the issue in the current code: 1. The current code assumes that if you use the MTT database reporter, you can reach the database. One of the first things it does is ping the server to ensure that it's reachable. The rationale is that you don't want MTT to run for a long time and then

Re: [MTT devel] [OMPI devel] mtt-submit, etc.

2015-10-23 Thread Jeff Squyres (jsquyres)
I see the issue in the current code: 1. The current code assumes that if you use the MTT database reporter, you can reach the database. One of the first things it does is ping the server to ensure that it's reachable. The rationale is that you don't want MTT to run for a long time and then

Re: [OMPI devel] OMPI devel] mtt-submit, etc.

2015-10-23 Thread Gilles Gouaillardet
George, Then you cannot use https otherwise certificate check will fail, Note if you have a proxy, you can tunnel to the proxy and that should be fine. The main drawback is the ssh connection must be active when contacting IU, and if a batch manager is used, no one knows when that will be

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Gilles Gouaillardet
Gianmario, Iirc, there is one pipe between orted and each children stderr. stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 This is the way stdout/stderr from tasks end up being printed by mpirun : orted does i/o forwarding (aka IOF) are you trying to migrate only one

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Federico Reghenzani
Hi Adrian and Gilles, first of all thank you for your responses. I'm working with Gianmario on this ambitious project. 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. > the

Re: [OMPI devel] mtt-submit, etc.

2015-10-23 Thread Gilles Gouaillardet
Howard, that has already been raised in http://www.open-mpi.org/community/lists/mtt-users/2014/10/0820.php at the end, Christoph claimed he could achieve that with mtt-relay (but provided no detail on how ...) You might want to check the full thread and/or ask Christoph directly Ralph,

Re: [OMPI devel] mtt-submit, etc.

2015-10-23 Thread Ralph Castain
I was thinking about this, and I believe it would require a change to the mtt client to avoid it. I’m working on a new Python-based version of it, and I’ll make sure to deal with this there. In the interim, I’ll have to defer to some old, gray Perl guru to update the current client > On Oct