Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks - that helps! On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote: > Definitely we are targeting ORTED failures here. If an ORTED fails than > any other ORTEDs connected to it will notice and report the failure. Of > course if the failure is an application than the ORTED on that node wil

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I looked through the patch a bit more today and had a few notes/questions. - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed? - in the orte_e

[OMPI devel] MPI application hangs after a checkpoint

2011-06-07 Thread Kishor Kharbas
Hello, I am trying to use checkpoint-restart functionality of OpenMPI. Most of the times checkpointing of MPI application behaves correctly, but in some situations the MPI application hangs indefinitely after the checkpoint is taken. Ompi-checkpoint terminates without error and I do get the snapsh

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
Definitely we are targeting ORTED failures here. If an ORTED fails than any other ORTEDs connected to it will notice and report the failure. Of course if the failure is an application than the ORTED on that node will be the only one to detect it. Also, if an ORTED is lost, all of the applicatio

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Quick question: could you please clarify this statement: ...because more than one ORTED could (and often will) detect the failure. > I don't understand how this can be true, except for detecting an ORTED failure. Only one orted can detect an MPI process failure, unless you have now involved orted

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Ah - thanks! That really helped clarify things. Much appreciated. Will look at the patch in this light... On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote: > > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
> > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch being

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote: > > On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > > > But the epoch is process-unique - i.e., it is the number of times that > this specific process has been started, which differs per proc since we > don't restart all the procs every time

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote: > > On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > > To adress your concerns about putting the epoch in the process name > structure, putting it in there rather than i

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread George Bosilca
On Jun 7, 2011, at 12:14 , Ralph Castain wrote: > But the epoch is process-unique - i.e., it is the number of times that this > specific process has been started, which differs per proc since we don't > restart all the procs every time one fails. Yes the epoch is per process, but it is distrib

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote: > > > On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote: > > To adress your concerns about putting the epoch in the process name > > structure, putting it in there rather than in a separately maintained

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote: > To adress your concerns about putting the epoch in the process name > structure, putting it in there rather than in a separately maintained list > simplifies things later. > Not really concerned - I was just noting we had done it a tad diffe

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
To adress your concerns about putting the epoch in the process name structure, putting it in there rather than in a separately maintained list simplifies things later. For example, during communication you need to attach the epoch to each of your messages so they can be tracked later. If a pro

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
Thanks for the explanation - as I said, I won't have time to really review the patch this week, but appreciate the info. I don't really expect to see a conflict as George had discussed this with me previously. I know I'll have merge conflicts with my state machine branch, which would be ready for

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
You might want to try a new checkout, just in case there's something in there that is svn:ignored...? (yes, I'm grasping at straws here, but I'm able to build ok with a clean checkout...?) On Jun 7, 2011, at 10:38 AM, George Bosilca wrote: > My 'svn status' indicates no differences. I always

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Josh Hursey
I briefly looked over the patch. Excluding the epochs (which we don't need now, but will soon) it looks similar to what I have setup on my MPI run-through stabilization branch - so it should support that work nicely. I'll try to test it this week and send back any other comments. Good work. Thank

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Wesley Bland
This could certainly work alongside another ORCM or any other fault detection/prediction/recovery mechanism. Most of the code is just dedicated to keeping the epoch up to date and tracking the status of the processes. The underlying idea was to provide a way for the application to decide what it

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
My 'svn status' indicates no differences. I always build using a VPATH, and in this case I did remove the build directory. However, the issue persisted. george. On Jun 7, 2011, at 10:31 , Jeff Squyres wrote: > I've seen VT builds get confused sometimes. I'm not sure of the exact cause, > bu

Re: [OMPI devel] VT support for 1.5

2011-06-07 Thread Jeff Squyres
I've seen VT builds get confused sometimes. I'm not sure of the exact cause, but if I get a new checkout, all the problems seem to go away. I've never had the time to track it down. Can you get a clean / new checkout and see if that fixes the problem? On Jun 7, 2011, at 10:27 AM, George Bosi

[OMPI devel] VT support for 1.5

2011-06-07 Thread George Bosilca
I can't compile the 1.5 is I do not disable VT. Using the following configure line: ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug I get: ar: /home/bosilca

[OMPI devel] Nightly tarball problem: fixed

2011-06-07 Thread Jeff Squyres
FYI: Terry discovered yesterday that the nightlies hadn't been made in a while for v1.4 and trunk. There was a filesystem permissions issue on the build server that has been fixed -- there are new nightly tarballs today for v1.4, v1.5, and trunk. -- Jeff Squyres jsquy...@cisco.com For corpora

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-07 Thread Ralph Castain
I'm on travel this week, but will look this over when I return. From the description, it sounds nearly identical to what we did in ORCM, so I expect there won't be many issues. You do get some race conditions that the new state machine code should help resolve. Only difference I can quickly see is

Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Sebastian Rinke
Worked. Thanks a lot! On Jun 7, 2011, at 6:43 AM, Mike Dubman wrote: > > Please try with "--mca mpi_leave_pinned 0" > > On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote: > Dear all, > > While trying to send a message of size 1610612736 B (1.5 GB), I get the > following error: > > [[52

Re: [OMPI devel] openib error for message size 1.5 GB

2011-06-07 Thread Mike Dubman
Please try with "--mca mpi_leave_pinned 0" On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote: > Dear all, > > While trying to send a message of size 1610612736 B (1.5 GB), I get the > following error: > > [[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_