Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has not 
performed progress for -say- 5 minutes (default in my implementation), we 
stop busy polling and have the process drop from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but is 
less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean way 
to achieve my goal, and from what I read, I guess I should look at the 
"tick" rate instead of trying to do my own delaying.


Don't worry, I was quite expecting the configure-in requirement. However, 
I don't think my patch is good for inclusion, it is only an example to 
describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can see 
some potential benefits. I'm also not sure that power consumption is that big 
of an issue that MPI needs to begin chasing "power saver" modes of operation, 
but that can be a separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as this would be 
very bad - I'm assuming you just change the opal_progress "tick" rate 
instead. True? If not, and you really call "sleep", then I would have to 
oppose adding this to the code base pending discussion with others who can 
corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a "configure-in" 
capability. Just having the params default to a value that causes the system 
to behave similarly to today isn't enough - we still wind up adding logic 
into a very critical timing loop for no reason. A simple configure option of 
--enable-mpi-progress-monitoring would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g. 5 
minutes, stop busy polling in opal_progress and switch to a usleep-based 
one.


Why : when we have long waits, and especially when an application is 
deadlock'ed, detecting it is not easy and a lot of power is wasted until 
the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/


Principle
=

opal_progress() ensures the progression of MPI communication. The current 
algorithm is a loop calling progress on all registered components. If the 
program is blocked, the loop will busy-poll indefinetely.


Going to sleep after a certain amount of time with nothing received is 
interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all the 
processes are in sleep(). Currently, all processors are using 100% cpu and 
it is very hard to know if progression is still happening or not.

- When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go to 
sleep just before the message arrives. This will highly depend on the 
parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep takes T 
usec, then sleeping after 1xT should slow down Receives by a factor 
less than 0.01 %.


However, other processes may suffer from you being late, and be delayed by 
T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect far-too-long-waits and 
should quite never be used in normal MPI jobs. It could also trigger a 
warning message when starting to sleep, or at least a trace in the 
notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() calls 
before we start the timer (to prevent latency impact). It defaults to -1, 
which completely deactivates the sleep (and is therefore equivalent to the 
former code). A value of 1000 can be thought of as a starting point to 
enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to 
low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further unsuccessful 
call to opal_progress(). Default : 1000 (in us) = 1 ms.


The duration is big enough to make the process show 0% CPU in top, but low 
enough to preserve a good trigger/duration ratio.


The trigger is voluntary high to keep a good trigger/duration ratio. 
Indeed, to prevent delays from causing chain reactions, trigger should be 
higher than duration * numprocs.


Possible Improvements & Pitfalls


* Trigger could be set automatically at max(trigger, duration * numprocs * 
2).


* poll_start and poll_count could be fields 

Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Pavel Shamis (Pasha)




Most of the IB protocols used by MPI target a LID.   There is no
existing notification path I know of that can replace LID-xyz with
LID-123.  The subnet manager might be able to do this but begs
security issues.

Interesting problem.
  

It is not exactly correct. For migration between ports
on the same HCA you may use IB APM feature. It is already implemented
in Open MPI 1.3.X

In case of port migration between different HCA we need to do it in 
software. And I guess it is

what Sylvain is doing.

Pasha.



Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Sylvain Jeaugey

On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote:

??? dual rail does double the number of switch ports. If you want to 
address switch failure each rail must connect to a different switch. 
If you do not want to have isolated fabrics you must have some 
additional ports on all switches to connect the two fabrics and enough 
of them to maintain sufficient bandwidth and connectivity when a switch 
fails. Thus, You are doubling the fabric unless I am missing something.
Well, it is pretty much research for now. But yes, we want each port to be 
connected to a different switch so that both cable and switch failures can 
be survived.


Open MPI currently needs to have connected fabrics, but maybe that's 
something we will like to change in the future, having two separate rails. 
(Btw Pasha, will your current work enable this ?)


Is your second set of switches so minimally connected that the second 
tree can be installed with a small switch count.
That's the idea, yes. For example, you could have a primary QDR fat-tree 
network and a failover non fat-tree DDR one (potentially recycled from a 
previous machine).



What are the odds when port 1 fails that port 2 is going to
be live.  Cable/ connector errors would be the most likely
case where port 2 would be live.  In general if port 1 fails
I would expect port 2 to have issues too.
Well, depending on the errors you want to be able to survive, you may have 
2 cards, in which case there is no reason why port1 failure would cause 
port2 to fail too. But in all cases, switches and cable errors are a 
concern to us.


Sylvain


Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Terry Dontje

Sylvain Jeaugey wrote:

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has 
not performed progress for -say- 5 minutes (default in my 
implementation), we stop busy polling and have the process drop from 
100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but 
is less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean 
way to achieve my goal, and from what I read, I guess I should look at 
the "tick" rate instead of trying to do my own delaying.


One way around this is to make all blocked communications (even SM) to 
use poll to block for incoming messages.  Jeff and I have discussed this 
and had many false starts on it.  The biggest issue is coming up with a 
way to have blocks on the SM btl converted to the system poll call 
without requiring a socket write for every packet.


The usleep solution works but is kind of ugly IMO.  I think when I 
looked at doing that the overhead increased signifcantly for certain 
communications.  Maybe not for toy benchmarks but for less synchronized 
processes I saw the usleep adding overhead where I didn't want it too.


--td
Don't worry, I was quite expecting the configure-in requirement. 
However, I don't think my patch is good for inclusion, it is only an 
example to describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I 
can see some potential benefits. I'm also not sure that power 
consumption is that big of an issue that MPI needs to begin chasing 
"power saver" modes of operation, but that can be a separate debate 
some day.


I'm assuming you don't mean that you actually call "sleep()" as this 
would be very bad - I'm assuming you just change the opal_progress 
"tick" rate instead. True? If not, and you really call "sleep", then 
I would have to oppose adding this to the code base pending 
discussion with others who can corroborate that this won't cause 
problems.


Either way, I could live with this so long as it was done as a 
"configure-in" capability. Just having the params default to a value 
that causes the system to behave similarly to today isn't enough - we 
still wind up adding logic into a very critical timing loop for no 
reason. A simple configure option of --enable-mpi-progress-monitoring 
would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g. 5 
minutes, stop busy polling in opal_progress and switch to a 
usleep-based one.


Why : when we have long waits, and especially when an application is 
deadlock'ed, detecting it is not easy and a lot of power is wasted 
until the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/


Principle
=

opal_progress() ensures the progression of MPI communication. The 
current algorithm is a loop calling progress on all registered 
components. If the program is blocked, the loop will busy-poll 
indefinetely.


Going to sleep after a certain amount of time with nothing received 
is interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all 
the processes are in sleep(). Currently, all processors are using 
100% cpu and it is very hard to know if progression is still 
happening or not.

- When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go 
to sleep just before the message arrives. This will highly depend on 
the parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep 
takes T usec, then sleeping after 1xT should slow down Receives 
by a factor less than 0.01 %.


However, other processes may suffer from you being late, and be 
delayed by T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect 
far-too-long-waits and should quite never be used in normal MPI 
jobs. It could also trigger a warning message when starting to 
sleep, or at least a trace in the notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() 
calls before we start the timer (to prevent latency impact). It 
defaults to -1, which completely deactivates the sleep (and is 
therefore equivalent to the former code). A value of 1000 can be 
thought of as a starting point to enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to 
low-pressure-powersave mode. Default : 600 (in 

[OMPI devel] Fwd: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns

2009-06-09 Thread Jeff Squyres
I'd be in favor of bringing this to v1.3.  Are there other  
dependencies / would it be difficult?



Begin forwarded message:


From: "Open MPI" 
Date: June 8, 2009 11:31:20 AM PDT
Cc: 
Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after  
~120 spawns


#1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
--- 
+

Reporter:  jsquyres|Owner:  rhc
Type:  defect  |   Status:  closed
Priority:  critical|Milestone:  Open MPI 1.3.4
 Version:  1.3 branch  |   Resolution:  fixed
Keywords:  |
--- 
+

Changes (by rhc):

  * status:  new => closed
  * resolution:  => fixed


Comment:

 This was due to a very tight loop on comm_spawn not giving enough  
time for

 the prior proc to completely terminate (and thus free its file
 descriptors) before the next proc was launched. Eventually, we  
built up a

 backlog of terminations to process and ran out of fd's.

 We introduced a check-and-delay in the code that detects we don't  
have

 enough fd's to launch another proc, and then waits a second to see if
 enough become free before aborting.

 Fixed in trunk - can see if we want to bring it to 1.3.

--
Ticket URL: 
Open MPI 





--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Fwd: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns

2009-06-09 Thread Ralph Castain
I don't think it would be very hard - I would have to create a patch  
for it, but the fix is completely contained in one file and location.


I would like to have someone else test it, though, before we move it  
across. It worked for me, but since it is a race condition, that isn't  
entirely convincing.



On Jun 9, 2009, at 5:41 AM, Jeff Squyres wrote:

I'd be in favor of bringing this to v1.3.  Are there other  
dependencies / would it be difficult?



Begin forwarded message:


From: "Open MPI" 
Date: June 8, 2009 11:31:20 AM PDT
Cc: 
Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails  
after ~120 spawns


#1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
--- 
+

Reporter:  jsquyres|Owner:  rhc
   Type:  defect  |   Status:  closed
Priority:  critical|Milestone:  Open MPI 1.3.4
Version:  1.3 branch  |   Resolution:  fixed
Keywords:  |
--- 
+

Changes (by rhc):

 * status:  new => closed
 * resolution:  => fixed


Comment:

This was due to a very tight loop on comm_spawn not giving enough  
time for

the prior proc to completely terminate (and thus free its file
descriptors) before the next proc was launched. Eventually, we  
built up a

backlog of terminations to process and ran out of fd's.

We introduced a check-and-delay in the code that detects we don't  
have

enough fd's to launch another proc, and then waits a second to see if
enough become free before aborting.

Fixed in trunk - can see if we want to bring it to 1.3.

--
Ticket URL: 

Open MPI 





--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Ralph Castain
My concern with any form of sleep is with the impact on the proc -  
since opal_progress might not be running in a separate thread, won't  
the sleep apply to the process as a whole? In that case, the process  
isn't free to continue computing.


I can envision applications that might call down into the MPI library  
and have opal_progress not find anything, but there is nothing wrong.  
The application could continue computations just fine. I would hate to  
see us put the process to sleep just because the MPI library wasn't  
busy enough.


Hence my suggestion to just change the tick rate. It would definitely  
cause a higher latency for the first message that arrived while in  
this state, which is bothersome, but would meet the stated objective  
without interfering with the process itself.


LANL has also been looking at this problem of stalled jobs, but from a  
different approach. We monitor (using a separate job) progress in  
terms of output files changing in size plus other factors as specified  
by the user. If we don't see any progress in those terms over some  
time, then we kill the job. We chose that path because of the concerns  
expressed above - e.g., on our RR machine, intense computations can be  
underway on the Cell blades while the Opteron MPI processes wait for  
us to reach a communication point. We -want- those processes spinning  
away so that, when the comm starts, it can proceed as quickly as  
possible.


Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:


Sylvain Jeaugey wrote:

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a  
normal scenario. The idea is just that if an MPI process is blocked  
(i.e. has not performed progress for -say- 5 minutes (default in my  
implementation), we stop busy polling and have the process drop  
from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same,  
but is less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more  
clean way to achieve my goal, and from what I read, I guess I  
should look at the "tick" rate instead of trying to do my own  
delaying.


One way around this is to make all blocked communications (even SM)  
to use poll to block for incoming messages.  Jeff and I have  
discussed this and had many false starts on it.  The biggest issue  
is coming up with a way to have blocks on the SM btl converted to  
the system poll call without requiring a socket write for every  
packet.


The usleep solution works but is kind of ugly IMO.  I think when I  
looked at doing that the overhead increased signifcantly for certain  
communications.  Maybe not for toy benchmarks but for less  
synchronized processes I saw the usleep adding overhead where I  
didn't want it too.


--td
Don't worry, I was quite expecting the configure-in requirement.  
However, I don't think my patch is good for inclusion, it is only  
an example to describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but  
I can see some potential benefits. I'm also not sure that power  
consumption is that big of an issue that MPI needs to begin  
chasing "power saver" modes of operation, but that can be a  
separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as  
this would be very bad - I'm assuming you just change the  
opal_progress "tick" rate instead. True? If not, and you really  
call "sleep", then I would have to oppose adding this to the code  
base pending discussion with others who can corroborate that this  
won't cause problems.


Either way, I could live with this so long as it was done as a  
"configure-in" capability. Just having the params default to a  
value that causes the system to behave similarly to today isn't  
enough - we still wind up adding logic into a very critical timing  
loop for no reason. A simple configure option of --enable-mpi- 
progress-monitoring would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g.  
5 minutes, stop busy polling in opal_progress and switch to a  
usleep-based one.


Why : when we have long waits, and especially when an application  
is deadlock'ed, detecting it is not easy and a lot of power is  
wasted until the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

Principle
=

opal_progress() ensures the progression of MPI communication. The  
current algorithm is a loop calling progress on all registered  
components. If the program is blocked, the loop will busy-poll  
indefinetely.


Going to sleep after a certain amount of time with nothing  
received is in

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

I understand your point of view, and mostly share it.

I think the biggest point in my example is that sleep occurs only after (I 
was wrong in my previous e-mail) 10 minutes of inactivity, and this value 
is fully configurable. I didn't intend to call sleep after 2 seconds. 
Plus, as said before, I planned to have the library do show_help() when 
this happens (something like : "Open MPI couldn't receive a message for 10 
minutes, lowering pressure") so that the application that really needs 
more than 10 minutes to receive a message can increase it.


Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand correctly your e-mail, you block in the 
kernel using poll(), is that right ? So, you may well loose 10 us because 
of that kernel call, but this is a lot less than the 1 ms I'm currently 
loosing with usleep. This makes sense - although being hard to implement 
since all btl must have this ability.


Thanks for your comments, I will continue to think about it.

Sylvain

On Tue, 9 Jun 2009, Ralph Castain wrote:

My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be running in a separate thread, won't the sleep 
apply to the process as a whole? In that case, the process isn't free to 
continue computing.


I can envision applications that might call down into the MPI library and 
have opal_progress not find anything, but there is nothing wrong. The 
application could continue computations just fine. I would hate to see us put 
the process to sleep just because the MPI library wasn't busy enough.


Hence my suggestion to just change the tick rate. It would definitely cause a 
higher latency for the first message that arrived while in this state, which 
is bothersome, but would meet the stated objective without interfering with 
the process itself.


LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor (using a separate job) progress in terms of 
output files changing in size plus other factors as specified by the user. If 
we don't see any progress in those terms over some time, then we kill the 
job. We chose that path because of the concerns expressed above - e.g., on 
our RR machine, intense computations can be underway on the Cell blades while 
the Opteron MPI processes wait for us to reach a communication point. We 
-want- those processes spinning away so that, when the comm starts, it can 
proceed as quickly as possible.


Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:


Sylvain Jeaugey wrote:

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has not 
performed progress for -say- 5 minutes (default in my implementation), we 
stop busy polling and have the process drop from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but is 
less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean way 
to achieve my goal, and from what I read, I guess I should look at the 
"tick" rate instead of trying to do my own delaying.


One way around this is to make all blocked communications (even SM) to use 
poll to block for incoming messages.  Jeff and I have discussed this and 
had many false starts on it.  The biggest issue is coming up with a way to 
have blocks on the SM btl converted to the system poll call without 
requiring a socket write for every packet.


The usleep solution works but is kind of ugly IMO.  I think when I looked 
at doing that the overhead increased signifcantly for certain 
communications.  Maybe not for toy benchmarks but for less synchronized 
processes I saw the usleep adding overhead where I didn't want it too.


--td
Don't worry, I was quite expecting the configure-in requirement. However, 
I don't think my patch is good for inclusion, it is only an example to 
describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can 
see some potential benefits. I'm also not sure that power consumption is 
that big of an issue that MPI needs to begin chasing "power saver" modes 
of operation, but that can be a separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as this 
would be very bad - I'm assuming you just change the opal_progress "tick" 
rate instead. True? If not, and you really call "sleep", then I would 
have to oppose adding this to the code base pending discussion with 
others who can corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a 
"configure-in" capability. Just having the params default to a value that 
causes the system to behave similarly to today isn't enough - we still 

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Ashley Pittman
On Mon, 2009-06-08 at 17:50 +0200, Sylvain Jeaugey wrote:
> Principle
> =
> 
> opal_progress() ensures the progression of MPI communication. The current 
> algorithm is a loop calling progress on all registered components. If the 
> program is blocked, the loop will busy-poll indefinetely.

I have some experience here due to implementing this feature (blocking
waits) on Quadrics hardware.  You're right that it can have benefits and
yielding the CPU when "idle" is a good thing in the general case.

The "correct" way for a process to relinquish the cpu is to block in a
select() or poll() call until data is received whereupon it can wake up
and continue working, the major problem each and every MPI
implementation has is that select() only works for tcp/ip and not for
shared memory or any of the more exotic networks.  IMHO it would be much
preferred to solve this problem properly and block in the wakeable
select() rather than usleep().

In my experience when done correctly the performance is affected however
surprisingly it can often lead to increased performance, we had full
coverage however so were able to sleep early and wake up in a timely
manner on receiving any message.  Yeilding even one cpu per node from
the application occasionally gives any background/os processing a chance
to run without impacting the performance of the application so enabling
blocking waits can lead to quicker runtimes.

> Going to sleep after a certain amount of time with nothing received is 
> interesting for two things :
>
>   - Administrator can easily detect whether a job is deadlocked : all the 
> processes are in sleep(). Currently, all processors are using 100% cpu and 
> it is very hard to know if progression is still happening or not.

This is a valuable thing to know however I don't view the proposed
solution as the correct one, if this were the problem you were aiming to
solve I'd recommend a different approach, more like the llnl solution
that Ralph described.

Yours,

Ashley Pittman.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI devel] Fwd: [Open MPI] #1927: v1.3 COMM_SPAWN loop testfails after ~120 spawns

2009-06-09 Thread Jeff Squyres
Tested -- seem to work for me.  I say we now let MTT sort it out  
(i.e., see if others hit this race condition) and apply to v1.3.


On Jun 9, 2009, at 4:46 AM, Ralph Castain wrote:


I don't think it would be very hard - I would have to create a patch
for it, but the fix is completely contained in one file and location.

I would like to have someone else test it, though, before we move it
across. It worked for me, but since it is a race condition, that isn't
entirely convincing.


On Jun 9, 2009, at 5:41 AM, Jeff Squyres wrote:

> I'd be in favor of bringing this to v1.3.  Are there other
> dependencies / would it be difficult?
>
>
> Begin forwarded message:
>
>> From: "Open MPI" 
>> Date: June 8, 2009 11:31:20 AM PDT
>> Cc: 
>> Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails
>> after ~120 spawns
>>
>> #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
>> ---
>> +
>> Reporter:  jsquyres|Owner:  rhc
>>Type:  defect  |   Status:  closed
>> Priority:  critical|Milestone:  Open MPI 1.3.4
>> Version:  1.3 branch  |   Resolution:  fixed
>> Keywords:  |
>> ---
>> +
>> Changes (by rhc):
>>
>>  * status:  new => closed
>>  * resolution:  => fixed
>>
>>
>> Comment:
>>
>> This was due to a very tight loop on comm_spawn not giving enough
>> time for
>> the prior proc to completely terminate (and thus free its file
>> descriptors) before the next proc was launched. Eventually, we
>> built up a
>> backlog of terminations to process and ran out of fd's.
>>
>> We introduced a check-and-delay in the code that detects we don't
>> have
>> enough fd's to launch another proc, and then waits a second to  
see if

>> enough become free before aborting.
>>
>> Fixed in trunk - can see if we want to bring it to 1.3.
>>
>> --
>> Ticket URL: > 3>
>> Open MPI 
>>
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Pavel Shamis (Pasha)


Open MPI currently needs to have connected fabrics, but maybe that's 
something we will like to change in the future, having two separate 
rails. (Btw Pasha, will your current work enable this ?)
I do not completely understand what do you mean here under two separate 
rails ...
Already today you may connect each port to different subnet, and ports 
in the same

subnet may talk to each other.

Pasha.


Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Ralph Castain
Couple of other things to help stimulate the thinking:

1. it isn't that OMPI -couldn't- receive a message, but rather that it
-didn't- receive a message. This may or may not indicate that there is a
problem. Could just be an application that doesn't need to communicate for
awhile, as per my example. I admit, though, that 10 minutes is a tad
long...but I've seen some bizarre apps around here :-)

2. instead of putting things to sleep or even adjusting the loop rate, you
might want to consider using the orte_notifier capability and notify the
system that the job may be stalled. Or perhaps adding an API to the
orte_errmgr framework to notify it that nothing has been received for
awhile, and let people implement different strategies for detecting what
might be "wrong" and what they want to do about it.

My point with this second bullet is that there are other response options
than hardwiring putting the process to sleep. You could let someone know so
a human can decide what, if anything, to do about it, or provide a hook so
that people can explore/utilize different response strategies...or both!

HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey wrote:

> I understand your point of view, and mostly share it.
>
> I think the biggest point in my example is that sleep occurs only after (I
> was wrong in my previous e-mail) 10 minutes of inactivity, and this value is
> fully configurable. I didn't intend to call sleep after 2 seconds. Plus, as
> said before, I planned to have the library do show_help() when this happens
> (something like : "Open MPI couldn't receive a message for 10 minutes,
> lowering pressure") so that the application that really needs more than 10
> minutes to receive a message can increase it.
>
> Looking at the tick rate code, I couldn't see how changing it would make
> CPU usage drop. If I understand correctly your e-mail, you block in the
> kernel using poll(), is that right ? So, you may well loose 10 us because of
> that kernel call, but this is a lot less than the 1 ms I'm currently loosing
> with usleep. This makes sense - although being hard to implement since all
> btl must have this ability.
>
> Thanks for your comments, I will continue to think about it.
>
> Sylvain
>
>
> On Tue, 9 Jun 2009, Ralph Castain wrote:
>
>  My concern with any form of sleep is with the impact on the proc - since
>> opal_progress might not be running in a separate thread, won't the sleep
>> apply to the process as a whole? In that case, the process isn't free to
>> continue computing.
>>
>> I can envision applications that might call down into the MPI library and
>> have opal_progress not find anything, but there is nothing wrong. The
>> application could continue computations just fine. I would hate to see us
>> put the process to sleep just because the MPI library wasn't busy enough.
>>
>> Hence my suggestion to just change the tick rate. It would definitely
>> cause a higher latency for the first message that arrived while in this
>> state, which is bothersome, but would meet the stated objective without
>> interfering with the process itself.
>>
>> LANL has also been looking at this problem of stalled jobs, but from a
>> different approach. We monitor (using a separate job) progress in terms of
>> output files changing in size plus other factors as specified by the user.
>> If we don't see any progress in those terms over some time, then we kill the
>> job. We chose that path because of the concerns expressed above - e.g., on
>> our RR machine, intense computations can be underway on the Cell blades
>> while the Opteron MPI processes wait for us to reach a communication point.
>> We -want- those processes spinning away so that, when the comm starts, it
>> can proceed as quickly as possible.
>>
>> Just some thoughts...
>> Ralph
>>
>>
>> On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:
>>
>>  Sylvain Jeaugey wrote:
>>>
 Hi Ralph,

 I'm entirely convinced that MPI doesn't have to save power in a normal
 scenario. The idea is just that if an MPI process is blocked (i.e. has not
 performed progress for -say- 5 minutes (default in my implementation), we
 stop busy polling and have the process drop from 100% CPU usage to 0%.

 I do not call sleep() but usleep(). The result if quite the same, but is
 less hurting performance in case of (unexpected) restart.

 However, the goal of my RFC was also to know if there was a more clean
 way to achieve my goal, and from what I read, I guess I should look at the
 "tick" rate instead of trying to do my own delaying.

  One way around this is to make all blocked communications (even SM) to
>>> use poll to block for incoming messages.  Jeff and I have discussed this and
>>> had many false starts on it.  The biggest issue is coming up with a way to
>>> have blocks on the SM btl converted to the system poll call without
>>> requiring a socket write for every packet.
>>>
>>> The usleep solution works but is

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey

On Tue, 9 Jun 2009, Ralph Castain wrote:


2. instead of putting things to sleep or even adjusting the loop rate, you 
might want to consider using the orte_notifier
capability and notify the system that the job may be stalled. Or perhaps adding 
an API to the orte_errmgr framework to
notify it that nothing has been received for awhile, and let people implement 
different strategies for detecting what might
be "wrong" and what they want to do about it.
Great remark. What is really needed here is the information of "nothing 
received for X minutes". Just having the information somewhere should be 
sufficient. We often see users asking if their application is still 
progressing, and this should answer their questions. This would also 
address the need of administrators to stop deadlocked runs during the 
night.


I guess I'll redirect my work on this and couple it with our current 
effort on logging and administration tools coupling.


Thanks a lot guys !

Sylvain


My point with this second bullet is that there are other response options than 
hardwiring putting the process to sleep. You
could let someone know so a human can decide what, if anything, to do about it, 
or provide a hook so that people can
explore/utilize different response strategies...or both!

HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey  
wrote:
  I understand your point of view, and mostly share it.

  I think the biggest point in my example is that sleep occurs only after 
(I was wrong in my previous e-mail) 10
  minutes of inactivity, and this value is fully configurable. I didn't 
intend to call sleep after 2 seconds.
  Plus, as said before, I planned to have the library do show_help() when this 
happens (something like : "Open
  MPI couldn't receive a message for 10 minutes, lowering pressure") so 
that the application that really needs
  more than 10 minutes to receive a message can increase it.

  Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand
  correctly your e-mail, you block in the kernel using poll(), is that 
right ? So, you may well loose 10 us
  because of that kernel call, but this is a lot less than the 1 ms I'm 
currently loosing with usleep. This makes
  sense - although being hard to implement since all btl must have this 
ability.

  Thanks for your comments, I will continue to think about it.

  Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:

  My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be
  running in a separate thread, won't the sleep apply to the process as a 
whole? In that case, the process
  isn't free to continue computing.

  I can envision applications that might call down into the MPI library and 
have opal_progress not find
  anything, but there is nothing wrong. The application could continue 
computations just fine. I would hate
  to see us put the process to sleep just because the MPI library wasn't 
busy enough.

  Hence my suggestion to just change the tick rate. It would definitely 
cause a higher latency for the
  first message that arrived while in this state, which is bothersome, but 
would meet the stated objective
  without interfering with the process itself.

  LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor
  (using a separate job) progress in terms of output files changing in size 
plus other factors as specified
  by the user. If we don't see any progress in those terms over some time, 
then we kill the job. We chose
  that path because of the concerns expressed above - e.g., on our RR 
machine, intense computations can be
  underway on the Cell blades while the Opteron MPI processes wait for us 
to reach a communication point.
  We -want- those processes spinning away so that, when the comm starts, it 
can proceed as quickly as
  possible.

  Just some thoughts...
  Ralph


  On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
  Hi Ralph,

  I'm entirely convinced that MPI doesn't have to save power in 
a normal scenario.
  The idea is just that if an MPI process is blocked (i.e. has 
not performed
  progress for -say- 5 minutes (default in my implementation), 
we stop busy polling
  and have the process drop from 100% CPU usage to 0%.

  I do not call sleep() but usleep(). The result if quite the 
same, but is less
  hurting performance in case of (unexpected) restart.

  However, the goal of my RFC was also to know if there was a 
more clean way to
  achieve my goal, and from what I read, I guess I should look at the 
"tick" rate
  instead of trying to do my own delaying.

One way a

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Jeff Squyres
I'll throw in my random $0.02.  I'm at the Forum this week, so my  
latency on replies here will likely be large.


1. Ashley is correct that we shouldn't sleep.  A better solution would  
be to block waiting for something to happen (rather than spin).  As  
Terry mentioned, we pretty much know how to do this -- it's just that  
no one has done it yet.  The full solution would then be: if we spin  
for a while (probably MCA-param settable) with nothing happening,  
switch to the blocking mode and continue waiting.  I'm happy to pass  
on the information on how we've imagined that this should be done, if  
you want.


2. Note that your solution presupposes that one MPI process can detect  
that the entire job is deadlocked.  This is not quite correct.  What  
exactly do you want to detect -- that one process may be imbalanced on  
its receives (waiting for long periods of time without doing  
anything), or that the entire job is deadlocked?  The former may be ok  
-- it depends on the app.  If the latter, it requires a bit more work  
-- e.g., if one process detects that nothing has happened for a long  
time, it can initiate a collective/distributed deadlock detection  
algorithm with all the other MPI processes in the job.  Only if *all*  
processes agree, then you can say "this job is deadlocked, we might as  
well abort."  IIRC, there are some 3rd party tools / libraries that do  
this kind of stuff...?  (although it might be cool / useful to  
incorporate some of this technology into OMPI itself)


3. As Ralph noted, how exactly do you know when "nothing happens for a  
long time" is a bad thing?  a) some codes are structured that way --  
that they'll have no MPI activity for a long time, even if they have  
pending non-blocking receives pre-posted.  b) are you looking within  
the scope of *one* MPI blocking call?  I.e., if nothing happens  
*within the span of one blocking MPI call*, or are you looking if  
nothing happens across successive calls to opal_progress() (which may  
be few and far between after OMPI hits steady state when using non-TCP  
networks)?  It seems like there would need to be a [thread safe]  
"reset" at some point -- indicating that something has happened.  That  
either would be when something has happened, or that a blocking MPI  
call has exited, or ?  Need to make sure that that "reset" doesn't  
get expensive.


4. Note, too, that opal_progress() doesn't see *all* progress - the  
openib BTL doesn't use opal_progress to know when OpenFabrics messages  
arrive, for example.



On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote:


Couple of other things to help stimulate the thinking:

1. it isn't that OMPI -couldn't- receive a message, but rather that  
it -didn't- receive a message. This may or may not indicate that  
there is a problem. Could just be an application that doesn't need  
to communicate for awhile, as per my example. I admit, though, that  
10 minutes is a tad long...but I've seen some bizarre apps around  
here :-)


2. instead of putting things to sleep or even adjusting the loop  
rate, you might want to consider using the orte_notifier capability  
and notify the system that the job may be stalled. Or perhaps adding  
an API to the orte_errmgr framework to notify it that nothing has  
been received for awhile, and let people implement different  
strategies for detecting what might be "wrong" and what they want to  
do about it.


My point with this second bullet is that there are other response  
options than hardwiring putting the process to sleep. You could let  
someone know so a human can decide what, if anything, to do about  
it, or provide a hook so that people can explore/utilize different  
response strategies...or both!


HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey > wrote:

I understand your point of view, and mostly share it.

I think the biggest point in my example is that sleep occurs only  
after (I was wrong in my previous e-mail) 10 minutes of inactivity,  
and this value is fully configurable. I didn't intend to call sleep  
after 2 seconds. Plus, as said before, I planned to have the library  
do show_help() when this happens (something like : "Open MPI  
couldn't receive a message for 10 minutes, lowering pressure") so  
that the application that really needs more than 10 minutes to  
receive a message can increase it.


Looking at the tick rate code, I couldn't see how changing it would  
make CPU usage drop. If I understand correctly your e-mail, you  
block in the kernel using poll(), is that right ? So, you may well  
loose 10 us because of that kernel call, but this is a lot less than  
the 1 ms I'm currently loosing with usleep. This makes sense -  
although being hard to implement since all btl must have this ability.


Thanks for your comments, I will continue to think about it.

Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:

My concern with any form of sleep is with the impact on the proc -  
since opal

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Jeff Squyres

On Jun 9, 2009, at 8:31 AM, Jeff Squyres (jsquyres) wrote:


4. Note, too, that opal_progress() doesn't see *all* progress - the
openib BTL doesn't use opal_progress to know when OpenFabrics messages
arrive, for example.




Wait, I lied -- sorry.

opal_progress will call the bml progress, which then calls each of the  
btl processes (or we changed it so that opal_progress directly calls  
each btl progres -- I forget which).  So technically opal_progress  
will see that "something" happened.  But what that "something" is is  
unknown -- it could be a control message, or somesuch.


(I don't recall offhand what the openib btl's progress function returns)

--
Jeff Squyres
Cisco Systems



[OMPI devel] Hang in collectives involving shared memory

2009-06-09 Thread Ralph Castain
Hi folks

As mentioned in today's telecon, we at LANL are continuing to see hangs when
running even small jobs that involve shared memory in collective operations.
This has been the topic of discussion before, but I bring it up again
because (a) the problem is beginning to become epidemic across our
application codes, and (b) repeated testing provides more info and (most
importantly) confirms that this problem -does not- occur under 1.2.x - it is
strictly a 1.3.2 (we haven't checked to see if it is in 1.3.0 or 1.3.1)
problem.

The condition is caused when the application performs a loop over collective
operations such as MPI_Allgather, MPI_Reduce, and MPI_Bcast. This list is
not intended to be exhaustive, but only represents the ones for which we
have solid and repeatable data. The symptoms are a "hanging" job, typically
(but not always!) associated with fully-consumed memory. The loops do not
have to involve substantial amounts of memory (the Bcast loop hangs after
moving a whole 32Mbytes, total), nor involve high loop counts. They only
have to repeatedly call the collective.

Disabling the shared memory BTL is enough to completely resolve the problem.
However, this creates an undesirable performance penalty we would like to
avoid, if possible.

Our current solution is to use the "sync" collective to occasionally insert
an MPI_Barrier into the code "behind the scenes" - i.e., to add an
MPI_Barrier call every N number of calls to "problem" collectives. The
argument in favor of this was that the hang is caused by consuming memory
due to "unexpected messages", caused principally by the root process in the
collective running slower than other procs. Thus, the notion goes, the root
process continues to fall further and further behind, consuming ever more
memory until it simply cannot progress. Adding the barrier operation forced
the other procs to "hold" until the root process could catch up, thereby
relieving the memory backlog.

The sync collective has worked for us, but we are now finding a very
disconcerting behavior - namely, that the precise value of N required to
avoid hanging (a) is very, very sensitive and can still let the app hang
even by changing the value by small amounts, (b) flunctuates between runs on
an unpredictable basis, and (c) can be different for different collectives.

These new problems surfaced this week when we found that a job that
previously ran fine with one value of coll_sync_barrier_before suddenly hung
when a loop over MPI_Bcast was added to the code. Further investigation has
found that the value of N required to make the new loop work is
significantly different than the prior value that made Allgather work,
creating an exhaustive search for a "sweet spot" for N.

Clearly, as codes grow in complexity, this simply is not going to work.

It seems to me that we have to begin investigating -why- the 1.3.2 code is
encountering this problem whereas the 1.2.x code is not. From our rough
measurements, there is a some speed difference between the two releases, so
perhaps we are now getting fast enough to create the problem - I don't think
we know enough yet to really claim this is true. At this time, we really
don't know -why- one process is running slow, or even if it is -always- the
root process that is doing so...nor have we confirmed (to my knowledge) that
our original analysis of the problem is correct!

We would appreciate any help with this problem. I gathered from today's
telecon that others are also encountering this, so perhaps there is enough
general pain to stimulate a team effort to resolve it!

Thanks
Ralph