Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks - that helps! On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote: > Definitely we are targeting ORTED failures here. If an ORTED fails than > any other ORTEDs connected to it will notice and report the failure. Of > course if the failure is an application than the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I looked through the patch a bit more today and had a few notes/questions. - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? - in the

[OMPI devel] MPI application hangs after a checkpoint

2011-06-07 Thread Kishor Kharbas
Hello, I am trying to use checkpoint-restart functionality of OpenMPI. Most of the times checkpointing of MPI application behaves correctly, but in some situations the MPI application hangs indefinitely after the checkpoint is taken. Ompi-checkpoint terminates without error and I do get the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
Definitely we are targeting ORTED failures here. If an ORTED fails than any other ORTEDs connected to it will notice and report the failure. Of course if the failure is an application than the ORTED on that node will be the only one to detect it. Also, if an ORTED is lost, all of the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Quick question: could you please clarify this statement: ...because more than one ORTED could (and often will) detect the failure. > I don't understand how this can be true, except for detecting an ORTED failure. Only one orted can detect an MPI process failure, unless you have now involved

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Ah - thanks! That really helped clarify things. Much appreciated. Will look at the patch in this light... On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote: > > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
> > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote: > > On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > > > But the epoch is process-unique - i.e., it is the number of times that > this specific process has been started, which differs per proc since we > don't restart

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote: > > On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > > To adress your concerns about putting the epoch in the process name >

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca
On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > But the epoch is process-unique - i.e., it is the number of times that this > specific process has been started, which differs per proc since we don't > restart all the procs every time one fails. Yes the epoch is per process, but it is

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote: > > To adress your concerns about putting the epoch in the process name > > structure, putting it in there rather than in

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
To adress your concerns about putting the epoch in the process name structure, putting it in there rather than in a separately maintained list simplifies things later. For example, during communication you need to attach the epoch to each of your messages so they can be tracked later. If a

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks for the explanation - as I said, I won't have time to really review the patch this week, but appreciate the info. I don't really expect to see a conflict as George had discussed this with me previously. I know I'll have merge conflicts with my state machine branch, which would be ready for

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
You might want to try a new checkout, just in case there's something in there that is svn:ignored...? (yes, I'm grasping at straws here, but I'm able to build ok with a clean checkout...?) On Jun 7, 2011, at 10:38 AM, George Bosilca wrote: > My 'svn status' indicates no differences. I always

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
My 'svn status' indicates no differences. I always build using a VPATH, and in this case I did remove the build directory. However, the issue persisted. george. On Jun 7, 2011, at 10:31 , Jeff Squyres wrote: > I've seen VT builds get confused sometimes. I'm not sure of the exact cause, >

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
I've seen VT builds get confused sometimes. I'm not sure of the exact cause, but if I get a new checkout, all the problems seem to go away. I've never had the time to track it down. Can you get a clean / new checkout and see if that fixes the problem? On Jun 7, 2011, at 10:27 AM, George

[OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
I can't compile the 1.5 is I do not disable VT. Using the following configure line: ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug I get: ar:

[OMPI devel] Nightly tarball problem: fixed

2011-06-07 Thread Jeff Squyres
FYI: Terry discovered yesterday that the nightlies hadn't been made in a while for v1.4 and trunk. There was a filesystem permissions issue on the build server that has been fixed -- there are new nightly tarballs today for v1.4, v1.5, and trunk. -- Jeff Squyres jsquy...@cisco.com For

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
I'm on travel this week, but will look this over when I return. From the description, it sounds nearly identical to what we did in ORCM, so I expect there won't be many issues. You do get some race conditions that the new state machine code should help resolve. Only difference I can quickly see

Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Sebastian Rinke
Worked. Thanks a lot! On Jun 7, 2011, at 6:43 AM, Mike Dubman wrote: > > Please try with "--mca mpi_leave_pinned 0" > > On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote: > Dear all, > > While trying to send a message of size 1610612736 B (1.5 GB), I get the >

Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Mike Dubman
Please try with "--mca mpi_leave_pinned 0" On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote: > Dear all, > > While trying to send a message of size 1610612736 B (1.5 GB), I get the > following error: > >