Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in a normal
scenario. The idea is just that if an MPI process is blocked (i.e. has not
performed progress for -say- 5 minutes (default in my implementation), we
stop busy polling and have the process drop from 100% CPU usage
Most of the IB protocols used by MPI target a LID. There is no
existing notification path I know of that can replace LID-xyz with
LID-123. The subnet manager might be able to do this but begs
security issues.
Interesting problem.
It is not exactly correct. For migration between port
On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote:
??? dual rail does double the number of switch ports. If you want to
address switch failure each rail must connect to a different switch.
If you do not want to have isolated fabrics you must have some
additional ports on all switches to connect
Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in a normal
scenario. The idea is just that if an MPI process is blocked (i.e. has
not performed progress for -say- 5 minutes (default in my
implementation), we stop busy polling and have the process d
I'd be in favor of bringing this to v1.3. Are there other
dependencies / would it be difficult?
Begin forwarded message:
From: "Open MPI"
Date: June 8, 2009 11:31:20 AM PDT
Cc:
Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after
~120 spawns
#1927: v1.3 COMM_SPAWN loop
I don't think it would be very hard - I would have to create a patch
for it, but the fix is completely contained in one file and location.
I would like to have someone else test it, though, before we move it
across. It worked for me, but since it is a race condition, that isn't
entirely con
My concern with any form of sleep is with the impact on the proc -
since opal_progress might not be running in a separate thread, won't
the sleep apply to the process as a whole? In that case, the process
isn't free to continue computing.
I can envision applications that might call down int
I understand your point of view, and mostly share it.
I think the biggest point in my example is that sleep occurs only after (I
was wrong in my previous e-mail) 10 minutes of inactivity, and this value
is fully configurable. I didn't intend to call sleep after 2 seconds.
Plus, as said before,
On Mon, 2009-06-08 at 17:50 +0200, Sylvain Jeaugey wrote:
> Principle
> =
>
> opal_progress() ensures the progression of MPI communication. The current
> algorithm is a loop calling progress on all registered components. If the
> program is blocked, the loop will busy-poll indefinetely.
Tested -- seem to work for me. I say we now let MTT sort it out
(i.e., see if others hit this race condition) and apply to v1.3.
On Jun 9, 2009, at 4:46 AM, Ralph Castain wrote:
I don't think it would be very hard - I would have to create a patch
for it, but the fix is completely contained i
Open MPI currently needs to have connected fabrics, but maybe that's
something we will like to change in the future, having two separate
rails. (Btw Pasha, will your current work enable this ?)
I do not completely understand what do you mean here under two separate
rails ...
Already today you
Couple of other things to help stimulate the thinking:
1. it isn't that OMPI -couldn't- receive a message, but rather that it
-didn't- receive a message. This may or may not indicate that there is a
problem. Could just be an application that doesn't need to communicate for
awhile, as per my exampl
On Tue, 9 Jun 2009, Ralph Castain wrote:
2. instead of putting things to sleep or even adjusting the loop rate, you
might want to consider using the orte_notifier
capability and notify the system that the job may be stalled. Or perhaps adding
an API to the orte_errmgr framework to
notify it th
I'll throw in my random $0.02. I'm at the Forum this week, so my
latency on replies here will likely be large.
1. Ashley is correct that we shouldn't sleep. A better solution would
be to block waiting for something to happen (rather than spin). As
Terry mentioned, we pretty much know how
On Jun 9, 2009, at 8:31 AM, Jeff Squyres (jsquyres) wrote:
4. Note, too, that opal_progress() doesn't see *all* progress - the
openib BTL doesn't use opal_progress to know when OpenFabrics messages
arrive, for example.
Wait, I lied -- sorry.
opal_progress will call the bml progress, which t
Hi folks
As mentioned in today's telecon, we at LANL are continuing to see hangs when
running even small jobs that involve shared memory in collective operations.
This has been the topic of discussion before, but I bring it up again
because (a) the problem is beginning to become epidemic across ou
16 matches
Mail list logo