Thanks - that helps!
On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote:
> Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the ORTED on that node wil
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the orte_e
Hello,
I am trying to use checkpoint-restart functionality of OpenMPI. Most of the
times checkpointing of MPI application behaves correctly, but in some
situations the MPI application hangs indefinitely after the checkpoint is
taken. Ompi-checkpoint terminates without error and I do get the snapsh
Definitely we are targeting ORTED failures here. If an ORTED fails than any
other ORTEDs connected to it will notice and report the failure. Of course if
the failure is an application than the ORTED on that node will be the only one
to detect it.
Also, if an ORTED is lost, all of the applicatio
Quick question: could you please clarify this statement:
...because more than one ORTED could (and often will) detect the failure.
>
I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved orted
Ah - thanks! That really helped clarify things. Much appreciated.
Will look at the patch in this light...
On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote:
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch being
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:
>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart all the procs every time
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote:
>
> On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
>
> To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than i
On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails.
Yes the epoch is per process, but it is distrib
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name
> > structure, putting it in there rather than in a separately maintained
On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
> To adress your concerns about putting the epoch in the process name
> structure, putting it in there rather than in a separately maintained list
> simplifies things later.
>
Not really concerned - I was just noting we had done it a tad diffe
To adress your concerns about putting the epoch in the process name structure,
putting it in there rather than in a separately maintained list simplifies
things later.
For example, during communication you need to attach the epoch to each of your
messages so they can be tracked later. If a pro
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.
I know I'll have merge conflicts with my state machine branch, which would
be ready for
You might want to try a new checkout, just in case there's something in there
that is svn:ignored...?
(yes, I'm grasping at straws here, but I'm able to build ok with a clean
checkout...?)
On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:
> My 'svn status' indicates no differences. I always
I briefly looked over the patch. Excluding the epochs (which we don't
need now, but will soon) it looks similar to what I have setup on my
MPI run-through stabilization branch - so it should support that work
nicely. I'll try to test it this week and send back any other
comments.
Good work.
Thank
This could certainly work alongside another ORCM or any other fault
detection/prediction/recovery mechanism. Most of the code is just dedicated to
keeping the epoch up to date and tracking the status of the processes. The
underlying idea was to provide a way for the application to decide what it
My 'svn status' indicates no differences. I always build using a VPATH, and in
this case I did remove the build directory. However, the issue persisted.
george.
On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> I've seen VT builds get confused sometimes. I'm not sure of the exact cause,
> bu
I've seen VT builds get confused sometimes. I'm not sure of the exact cause,
but if I get a new checkout, all the problems seem to go away. I've never had
the time to track it down.
Can you get a clean / new checkout and see if that fixes the problem?
On Jun 7, 2011, at 10:27 AM, George Bosi
I can't compile the 1.5 is I do not disable VT. Using the following configure
line:
../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug
--enable-mpirun-prefix-by-default --with-knem=/usr/local/knem
--with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
I get:
ar: /home/bosilca
FYI: Terry discovered yesterday that the nightlies hadn't been made in a while
for v1.4 and trunk. There was a filesystem permissions issue on the build
server that has been fixed -- there are new nightly tarballs today for v1.4,
v1.5, and trunk.
--
Jeff Squyres
jsquy...@cisco.com
For corpora
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.
Only difference I can quickly see is
Worked.
Thanks a lot!
On Jun 7, 2011, at 6:43 AM, Mike Dubman wrote:
>
> Please try with "--mca mpi_leave_pinned 0"
>
> On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote:
> Dear all,
>
> While trying to send a message of size 1610612736 B (1.5 GB), I get the
> following error:
>
> [[52
Please try with "--mca mpi_leave_pinned 0"
On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote:
> Dear all,
>
> While trying to send a message of size 1610612736 B (1.5 GB), I get the
> following error:
>
> [[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_
24 matches
Mail list logo