Thanks - that helps!
On Tue, Jun 7, 2011 at 1:25 PM, Wesley Bland wrote:
> Definitely we are targeting ORTED failures here. If an ORTED fails than
> any other ORTEDs connected to it will notice and report the failure. Of
> course if the failure is an application than the
I looked through the patch a bit more today and had a few notes/questions.
- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?
- in the
Hello,
I am trying to use checkpoint-restart functionality of OpenMPI. Most of the
times checkpointing of MPI application behaves correctly, but in some
situations the MPI application hangs indefinitely after the checkpoint is
taken. Ompi-checkpoint terminates without error and I do get the
Definitely we are targeting ORTED failures here. If an ORTED fails than any
other ORTEDs connected to it will notice and report the failure. Of course if
the failure is an application than the ORTED on that node will be the only one
to detect it.
Also, if an ORTED is lost, all of the
Quick question: could you please clarify this statement:
...because more than one ORTED could (and often will) detect the failure.
>
I don't understand how this can be true, except for detecting an ORTED
failure. Only one orted can detect an MPI process failure, unless you have
now involved
Ah - thanks! That really helped clarify things. Much appreciated.
Will look at the patch in this light...
On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland wrote:
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the
>
> Perhaps it would help if you folks could provide a little explanation about
> how you use epoch? While the value sounds similar, your explanations are
> beginning to sound very different from what we are doing and/or had
> envisioned.
>
> I'm not sure how you can talk about an epoch
On Tue, Jun 7, 2011 at 10:37 AM, George Bosilca wrote:
>
> On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
>
> > But the epoch is process-unique - i.e., it is the number of times that
> this specific process has been started, which differs per proc since we
> don't restart
On Tue, Jun 7, 2011 at 10:35 AM, Wesley Bland wrote:
>
> On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland wrote:
>
> To adress your concerns about putting the epoch in the process name
>
On Jun 7, 2011, at 12:14 , Ralph Castain wrote:
> But the epoch is process-unique - i.e., it is the number of times that this
> specific process has been started, which differs per proc since we don't
> restart all the procs every time one fails.
Yes the epoch is per process, but it is
On Tuesday, June 7, 2011 at 12:14 PM, Ralph Castain wrote:
>
>
> On Tue, Jun 7, 2011 at 9:45 AM, Wesley Bland (mailto:wbl...@eecs.utk.edu)> wrote:
> > To adress your concerns about putting the epoch in the process name
> > structure, putting it in there rather than in
To adress your concerns about putting the epoch in the process name structure,
putting it in there rather than in a separately maintained list simplifies
things later.
For example, during communication you need to attach the epoch to each of your
messages so they can be tracked later. If a
Thanks for the explanation - as I said, I won't have time to really review
the patch this week, but appreciate the info. I don't really expect to see a
conflict as George had discussed this with me previously.
I know I'll have merge conflicts with my state machine branch, which would
be ready for
You might want to try a new checkout, just in case there's something in there
that is svn:ignored...?
(yes, I'm grasping at straws here, but I'm able to build ok with a clean
checkout...?)
On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:
> My 'svn status' indicates no differences. I always
My 'svn status' indicates no differences. I always build using a VPATH, and in
this case I did remove the build directory. However, the issue persisted.
george.
On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> I've seen VT builds get confused sometimes. I'm not sure of the exact cause,
>
I've seen VT builds get confused sometimes. I'm not sure of the exact cause,
but if I get a new checkout, all the problems seem to go away. I've never had
the time to track it down.
Can you get a clean / new checkout and see if that fixes the problem?
On Jun 7, 2011, at 10:27 AM, George
I can't compile the 1.5 is I do not disable VT. Using the following configure
line:
../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug
--enable-mpirun-prefix-by-default --with-knem=/usr/local/knem
--with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
I get:
ar:
FYI: Terry discovered yesterday that the nightlies hadn't been made in a while
for v1.4 and trunk. There was a filesystem permissions issue on the build
server that has been fixed -- there are new nightly tarballs today for v1.4,
v1.5, and trunk.
--
Jeff Squyres
jsquy...@cisco.com
For
I'm on travel this week, but will look this over when I return. From the
description, it sounds nearly identical to what we did in ORCM, so I expect
there won't be many issues. You do get some race conditions that the new
state machine code should help resolve.
Only difference I can quickly see
Worked.
Thanks a lot!
On Jun 7, 2011, at 6:43 AM, Mike Dubman wrote:
>
> Please try with "--mca mpi_leave_pinned 0"
>
> On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote:
> Dear all,
>
> While trying to send a message of size 1610612736 B (1.5 GB), I get the
>
Please try with "--mca mpi_leave_pinned 0"
On Mon, Jun 6, 2011 at 4:16 PM, Sebastian Rinke wrote:
> Dear all,
>
> While trying to send a message of size 1610612736 B (1.5 GB), I get the
> following error:
>
>
21 matches
Mail list logo