Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-08 Thread Ralph Castain
I'm not entirely convinced this actually achieves your goals, but I  
can see some potential benefits. I'm also not sure that power  
consumption is that big of an issue that MPI needs to begin chasing  
"power saver" modes of operation, but that can be a separate debate  
some day.


I'm assuming you don't mean that you actually call "sleep()" as this  
would be very bad - I'm assuming you just change the opal_progress  
"tick" rate instead. True? If not, and you really call "sleep", then I  
would have to oppose adding this to the code base pending discussion  
with others who can corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a  
"configure-in" capability. Just having the params default to a value  
that causes the system to behave similarly to today isn't enough - we  
still wind up adding logic into a very critical timing loop for no  
reason. A simple configure option of --enable-mpi-progress-monitoring  
would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g. 5  
minutes, stop busy polling in opal_progress and switch to a usleep- 
based one.


Why : when we have long waits, and especially when an application is  
deadlock'ed, detecting it is not easy and a lot of power is wasted  
until the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

Principle
=

opal_progress() ensures the progression of MPI communication. The  
current algorithm is a loop calling progress on all registered  
components. If the program is blocked, the loop will busy-poll  
indefinetely.


Going to sleep after a certain amount of time with nothing received  
is interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all  
the processes are in sleep(). Currently, all processors are using  
100% cpu and it is very hard to know if progression is still  
happening or not.

- When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go  
to sleep just before the message arrives. This will highly depend on  
the parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep  
takes T usec, then sleeping after 1xT should slow down Receives  
by a factor less than 0.01 %.


However, other processes may suffer from you being late, and be  
delayed by T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect far-too-long- 
waits and should quite never be used in normal MPI jobs. It could  
also trigger a warning message when starting to sleep, or at least a  
trace in the notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress()  
calls before we start the timer (to prevent latency impact). It  
defaults to -1, which completely deactivates the sleep (and is  
therefore equivalent to the former code). A value of 1000 can be  
thought of as a starting point to enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to low- 
pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further  
unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.


The duration is big enough to make the process show 0% CPU in top,  
but low enough to preserve a good trigger/duration ratio.


The trigger is voluntary high to keep a good trigger/duration ratio.  
Indeed, to prevent delays from causing chain reactions, trigger  
should be higher than duration * numprocs.


Possible Improvements & Pitfalls


* Trigger could be set automatically at max(trigger, duration *  
numprocs * 2).


* poll_start and poll_count could be fields of the opal_condition_t  
struct.


* The sleep section may be exported in a #define and reported in all  
the progress pathes (I'm not sure my patch is good for progress  
threads for example)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Ralph Castain
I believe the concern here was that we aren't entirely sure just where  
you plan to do this. If we are talking about reporting errors, then  
there is less concern about adding cycles. For example, we already  
check to see if the IB driver has exceeded the limit on retries -  
adding more logic to the code that executes when that test is positive  
is of little concern.


However, if we are talking about adding warnings that are not in the  
error paths, then there is concern because that code will execute  
every time, even when there isn't a problem. There is no issue with  
using likely() directives, but I'm not sure there is general agreement  
with your analysis regarding the potential impact of adding such code,  
and the belief that it only adds one cycle doesn't appear to be  
supported by our experience to date. Hence the cautions from other  
developers.


Regardless, it has been our general policy to add this kind of  
capability on a "configure-in" basis so that those who do not want it  
are not impacted by it. My proposed method would allow for that  
policy. Whether you use that approach, or devise your own, I do  
believe the "configure-in" policy really needs to be used for this  
capability.


Working on a tmp branch will give developers a chance to evaluate the  
overall impact and help people in deciding whether or not to enable  
this capability. I suspect (based on prior similar proposals) that  
many will choose -not- to enable it (e.g., research clusters in  
universities), while some (e.g., large production clusters) may well  
do so, depending on exactly what you are reporting.


HTH
Ralph



On Jun 8, 2009, at 4:57 AM, Sylvain Jeaugey wrote:


Ralph,

Sorry for answering on this old thread, but it seems that my answer  
was blocked in the "postponed" folder.


About the if-then, I thought it was 1 cycle. I mean, if you don't  
break the pipeline, i.e. use likely() or builtin_expect() or  
something like that to be sure that the compiler will generate  
assembly in the right way, it shouldn't be more than 1 cycle,  
perhaps less on some architectures like Itanium [however, my multi- 
architecture view is somewhat limited to x86 and ia64, so I may be  
wrong].


So, in these if-then cases where we know which branch is the more  
likely to be used, I don't think that 1 CPU cycle is really a  
problem, especially if we are already in a slow code path.


Is there a multi-compiler,multi-arch,multi-os reason not to use  
likely() directives ?


Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:

While that is a good way of minimizing the impact of the counter,  
you still have to do an "if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed  
every time, and generally consumes more than a few cycles.
To be clear: it isn't the output that is the concern. The output  
only occurs as an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with  
the impact of testing to see if the output needs to be

generated as this testing occurs every time we transit the code.
I think Jeff and I are probably closer to agreement on design than  
it might seem, and may be close to what you might also

have had in mind. Basically, I was thinking of a macro like this:
ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
orte_notifier.api(.)
}
#endif
You would set the specific thresholds for each situation via MCA  
params, so this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those  
who want this level of information can enable it.
We can then see just how much penalty is involved in real world  
situations. My guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this  
code.

Hope that makes sense
Ralph
On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey  wrote:
 About performance, I may miss something, but our first goal  
was to track already slow pathes.


 We imagined that it could be possible to add at the beginning  
(or end) of this "bad path" just one line that
 would basically do an atomic inc. So, in terms of CPU cycles,  
something like 1 for the inc and maybe 1 jump
 before. Are a couple of cycles really an issue in slow pathes  
(which take at least hundreds of cycles), or do

 you fear out-of-cache memory accesses - or something else ?

 As for outputs, they indeed are slow (and can slow down  
considerably an application if not synchronized), but
 aggregation on the head node should solve our problems. And if  
not, we can also disable outputs at runtime.


 So, in my opinion, no application should notice a difference  
(unless you tune the framework to output every

 warning).

 Sylvain
On Tue, 26 May 2009, Jeff Squyres wrote:

 Nadia --


Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Sylvain Jeaugey

Ralph,

Sorry for answering on this old thread, but it seems that my answer was 
blocked in the "postponed" folder.


About the if-then, I thought it was 1 cycle. I mean, if you don't break 
the pipeline, i.e. use likely() or builtin_expect() or something like that 
to be sure that the compiler will generate assembly in the right way, it 
shouldn't be more than 1 cycle, perhaps less on some architectures like 
Itanium [however, my multi-architecture view is somewhat limited to x86 
and ia64, so I may be wrong].


So, in these if-then cases where we know which branch is the more likely 
to be used, I don't think that 1 CPU cycle is really a problem, especially 
if we are already in a slow code path.


Is there a multi-compiler,multi-arch,multi-os reason not to use likely() 
directives ?


Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:


While that is a good way of minimizing the impact of the counter, you still have to do an 
"if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed every time, and 
generally consumes more than a few cycles.

To be clear: it isn't the output that is the concern. The output only occurs as 
an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with the impact 
of testing to see if the output needs to be
generated as this testing occurs every time we transit the code.

I think Jeff and I are probably closer to agreement on design than it might 
seem, and may be close to what you might also
have had in mind. Basically, I was thinking of a macro like this:

ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)

#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
    orte_notifier.api(.)
}
#endif

You would set the specific thresholds for each situation via MCA params, so 
this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those who want this 
level of information can enable it.

We can then see just how much penalty is involved in real world situations. My 
guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this code.

Hope that makes sense
Ralph


On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey  
wrote:
  About performance, I may miss something, but our first goal was to track 
already slow pathes.

  We imagined that it could be possible to add at the beginning (or end) of this 
"bad path" just one line that
  would basically do an atomic inc. So, in terms of CPU cycles, something 
like 1 for the inc and maybe 1 jump
  before. Are a couple of cycles really an issue in slow pathes (which take 
at least hundreds of cycles), or do
  you fear out-of-cache memory accesses - or something else ?

  As for outputs, they indeed are slow (and can slow down considerably an 
application if not synchronized), but
  aggregation on the head node should solve our problems. And if not, we 
can also disable outputs at runtime.

  So, in my opinion, no application should notice a difference (unless you 
tune the framework to output every
  warning).

  Sylvain


On Tue, 26 May 2009, Jeff Squyres wrote:

  Nadia --

  Sorry I didn't get to jump in on the other thread earlier.

  We have made considerable changes to the notifier framework in a branch to better 
support "SOS"
  functionality:

   https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

  Cisco and Indiana U. have been working on this branch for a while.  A 
description of the SOS stuff is
  here:

   https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

  As for setting up an external web server with hg, don't bother -- just 
get an account at bitbucket.org.
   They're free and allow you to host hg repositories there.  I've used 
bitbucket to collaborate on code
  before it hits OMPI's SVN trunk with both internal and external OMPI 
developers.

  We can certainly move the opal-sos repo to bitbucket (or branch again off 
opal-sos to bitbucket --
  whatever makes more sense) to facilitate collaborating with you.

  Back on topic...

  I'd actually suggest a combination of what has been discussed in the 
other thread.  The notifier can be
  the mechanism that actually sends the output message, but it doesn't have 
to be the mechanism that tracks
  the stats and decides when to output a message.  That can be separate 
logic, and therefore be more
  fine-grained (and potentially even specific to the MPI layer).

  The Big Question will how to do this with zero performance impact when it 
is not being used. This has
  always been the difficult issue when trying to implement any kind of 
monitoring inside the core OMPI
  performance-sensitive paths.  Even adding individual branches has met 
with resistance (in
  performance-critical code 

Re: [OMPI devel] Multi-rail on openib

2009-06-08 Thread Sylvain Jeaugey

Hi Tom,

Yes, there is a goal in mind, and definetly not performance : we are 
working on device failover, i.e when a network adapter or switch fails, 
use the remaining one. We don't intend to improve performance with 
multi-rail (which as you said, will not happen unless you have a DDR card 
with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the 
doubled network :)).


The goal here is to use port 1 of each card as a primary way of 
communication with a fat tree and port 2 as a failover solution with a 
very light network, just to avoid aborting the MPI app or at least reach a 
checkpoint.


Don't worry, another team is working on opensm, so that routing stays 
optimal.


Thanks for your warnings however, it's true that a lot of people see these 
"double port IB cards" as "doubled performance".


Sylvain

On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote:


On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote:


See this FAQ entry for a description:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup

Right now, there's no way to force a particular connection pattern on
the openib btl at run-time.  The startup sequence has gotten
sufficiently complicated / muddied over the years that it would be quite
difficult to do so.  Pasha is in the middle of revamping parts of the
openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be
desirable to fully clean up the full openib btl startup sequence when
he's all finished.


On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote:


Hi all,

I am working on  multi-rail IB and I was wondering how connections are
established between ports.  I have two hosts, each with 2 ports on a
same IB card, connected to the same switch.



Is there a goal in mind?

In general multi-rail cards run into bandwidth and congestion issues
with the host bus.  If your card's system side interface cannot support
the bandwidth of twin IB links then it is possible that bandwidth would
be reduced by the interaction.

If the host bus and memory system is fast enough then
work with the vendor.

In addition to system bandwidth the subnet manager may need to be enhanced
to be multi-port card aware.   Since IB fabric routes are static it is possible
to route or use pairs of links in an identical enough way that there is
little bandwidth gain when multiple switches are involved.

Your two host case case may be simple enoughto explore
and/or generate illuminating or misleading results.
It is a good place to start.

Start with a look at opensm and the fabric then watch how Open MPI
or your applications use the resulting LIDs.  If you are using IB directly
and not MPI then the list of protocol choices grows dramatically but still
centers on LIDs as assigned by the subnet manager (see opensm).

How man CPU cores (ranks) are you working with?

Do be specific about the IB hardware and associated firmware
there are multiple choices out there and the vendor may be able to help...

--
T o m  M i t c h e l l
Found me a new hat, now what?

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel