Re: [OMPI devel] Multi-rail on openib

2009-06-08 Thread Sylvain Jeaugey

Hi Tom,

Yes, there is a goal in mind, and definetly not performance : we are 
working on device failover, i.e when a network adapter or switch fails, 
use the remaining one. We don't intend to improve performance with 
multi-rail (which as you said, will not happen unless you have a DDR card 
with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the 
doubled network :)).


The goal here is to use port 1 of each card as a primary way of 
communication with a fat tree and port 2 as a failover solution with a 
very light network, just to avoid aborting the MPI app or at least reach a 
checkpoint.


Don't worry, another team is working on opensm, so that routing stays 
optimal.


Thanks for your warnings however, it's true that a lot of people see these 
"double port IB cards" as "doubled performance".


Sylvain

On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote:


On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote:


See this FAQ entry for a description:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup

Right now, there's no way to force a particular connection pattern on
the openib btl at run-time.  The startup sequence has gotten
sufficiently complicated / muddied over the years that it would be quite
difficult to do so.  Pasha is in the middle of revamping parts of the
openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be
desirable to fully clean up the full openib btl startup sequence when
he's all finished.


On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote:


Hi all,

I am working on  multi-rail IB and I was wondering how connections are
established between ports.  I have two hosts, each with 2 ports on a
same IB card, connected to the same switch.



Is there a goal in mind?

In general multi-rail cards run into bandwidth and congestion issues
with the host bus.  If your card's system side interface cannot support
the bandwidth of twin IB links then it is possible that bandwidth would
be reduced by the interaction.

If the host bus and memory system is fast enough then
work with the vendor.

In addition to system bandwidth the subnet manager may need to be enhanced
to be multi-port card aware.   Since IB fabric routes are static it is possible
to route or use pairs of links in an identical enough way that there is
little bandwidth gain when multiple switches are involved.

Your two host case case may be simple enoughto explore
and/or generate illuminating or misleading results.
It is a good place to start.

Start with a look at opensm and the fabric then watch how Open MPI
or your applications use the resulting LIDs.  If you are using IB directly
and not MPI then the list of protocol choices grows dramatically but still
centers on LIDs as assigned by the subnet manager (see opensm).

How man CPU cores (ranks) are you working with?

Do be specific about the IB hardware and associated firmware
there are multiple choices out there and the vendor may be able to help...

--
T o m  M i t c h e l l
Found me a new hat, now what?

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Sylvain Jeaugey

Ralph,

Sorry for answering on this old thread, but it seems that my answer was 
blocked in the "postponed" folder.


About the if-then, I thought it was 1 cycle. I mean, if you don't break 
the pipeline, i.e. use likely() or builtin_expect() or something like that 
to be sure that the compiler will generate assembly in the right way, it 
shouldn't be more than 1 cycle, perhaps less on some architectures like 
Itanium [however, my multi-architecture view is somewhat limited to x86 
and ia64, so I may be wrong].


So, in these if-then cases where we know which branch is the more likely 
to be used, I don't think that 1 CPU cycle is really a problem, especially 
if we are already in a slow code path.


Is there a multi-compiler,multi-arch,multi-os reason not to use likely() 
directives ?


Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:


While that is a good way of minimizing the impact of the counter, you still have to do an 
"if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed every time, and 
generally consumes more than a few cycles.

To be clear: it isn't the output that is the concern. The output only occurs as 
an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with the impact 
of testing to see if the output needs to be
generated as this testing occurs every time we transit the code.

I think Jeff and I are probably closer to agreement on design than it might 
seem, and may be close to what you might also
have had in mind. Basically, I was thinking of a macro like this:

ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)

#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
    orte_notifier.api(.)
}
#endif

You would set the specific thresholds for each situation via MCA params, so 
this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those who want this 
level of information can enable it.

We can then see just how much penalty is involved in real world situations. My 
guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this code.

Hope that makes sense
Ralph


On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey  
wrote:
  About performance, I may miss something, but our first goal was to track 
already slow pathes.

  We imagined that it could be possible to add at the beginning (or end) of this 
"bad path" just one line that
  would basically do an atomic inc. So, in terms of CPU cycles, something 
like 1 for the inc and maybe 1 jump
  before. Are a couple of cycles really an issue in slow pathes (which take 
at least hundreds of cycles), or do
  you fear out-of-cache memory accesses - or something else ?

  As for outputs, they indeed are slow (and can slow down considerably an 
application if not synchronized), but
  aggregation on the head node should solve our problems. And if not, we 
can also disable outputs at runtime.

  So, in my opinion, no application should notice a difference (unless you 
tune the framework to output every
  warning).

  Sylvain


On Tue, 26 May 2009, Jeff Squyres wrote:

  Nadia --

  Sorry I didn't get to jump in on the other thread earlier.

  We have made considerable changes to the notifier framework in a branch to better 
support "SOS"
  functionality:

   https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

  Cisco and Indiana U. have been working on this branch for a while.  A 
description of the SOS stuff is
  here:

   https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

  As for setting up an external web server with hg, don't bother -- just 
get an account at bitbucket.org.
   They're free and allow you to host hg repositories there.  I've used 
bitbucket to collaborate on code
  before it hits OMPI's SVN trunk with both internal and external OMPI 
developers.

  We can certainly move the opal-sos repo to bitbucket (or branch again off 
opal-sos to bitbucket --
  whatever makes more sense) to facilitate collaborating with you.

  Back on topic...

  I'd actually suggest a combination of what has been discussed in the 
other thread.  The notifier can be
  the mechanism that actually sends the output message, but it doesn't have 
to be the mechanism that tracks
  the stats and decides when to output a message.  That can be separate 
logic, and therefore be more
  fine-grained (and potentially even specific to the MPI layer).

  The Big Question will how to do this with zero performance impact when it 
is not being used. This has
  always been the difficult issue when trying to implement any kind of 
monitoring inside the core OMPI
  performance-sensitive paths.  Even adding individual branches has met 
with resistance (in
  performance-critical code paths)...



  On May 26, 2009

Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Ralph Castain
I believe the concern here was that we aren't entirely sure just where  
you plan to do this. If we are talking about reporting errors, then  
there is less concern about adding cycles. For example, we already  
check to see if the IB driver has exceeded the limit on retries -  
adding more logic to the code that executes when that test is positive  
is of little concern.


However, if we are talking about adding warnings that are not in the  
error paths, then there is concern because that code will execute  
every time, even when there isn't a problem. There is no issue with  
using likely() directives, but I'm not sure there is general agreement  
with your analysis regarding the potential impact of adding such code,  
and the belief that it only adds one cycle doesn't appear to be  
supported by our experience to date. Hence the cautions from other  
developers.


Regardless, it has been our general policy to add this kind of  
capability on a "configure-in" basis so that those who do not want it  
are not impacted by it. My proposed method would allow for that  
policy. Whether you use that approach, or devise your own, I do  
believe the "configure-in" policy really needs to be used for this  
capability.


Working on a tmp branch will give developers a chance to evaluate the  
overall impact and help people in deciding whether or not to enable  
this capability. I suspect (based on prior similar proposals) that  
many will choose -not- to enable it (e.g., research clusters in  
universities), while some (e.g., large production clusters) may well  
do so, depending on exactly what you are reporting.


HTH
Ralph



On Jun 8, 2009, at 4:57 AM, Sylvain Jeaugey wrote:


Ralph,

Sorry for answering on this old thread, but it seems that my answer  
was blocked in the "postponed" folder.


About the if-then, I thought it was 1 cycle. I mean, if you don't  
break the pipeline, i.e. use likely() or builtin_expect() or  
something like that to be sure that the compiler will generate  
assembly in the right way, it shouldn't be more than 1 cycle,  
perhaps less on some architectures like Itanium [however, my multi- 
architecture view is somewhat limited to x86 and ia64, so I may be  
wrong].


So, in these if-then cases where we know which branch is the more  
likely to be used, I don't think that 1 CPU cycle is really a  
problem, especially if we are already in a slow code path.


Is there a multi-compiler,multi-arch,multi-os reason not to use  
likely() directives ?


Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:

While that is a good way of minimizing the impact of the counter,  
you still have to do an "if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed  
every time, and generally consumes more than a few cycles.
To be clear: it isn't the output that is the concern. The output  
only occurs as an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with  
the impact of testing to see if the output needs to be

generated as this testing occurs every time we transit the code.
I think Jeff and I are probably closer to agreement on design than  
it might seem, and may be close to what you might also

have had in mind. Basically, I was thinking of a macro like this:
ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)
#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
orte_notifier.api(.)
}
#endif
You would set the specific thresholds for each situation via MCA  
params, so this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those  
who want this level of information can enable it.
We can then see just how much penalty is involved in real world  
situations. My guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this  
code.

Hope that makes sense
Ralph
On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey > wrote:
 About performance, I may miss something, but our first goal  
was to track already slow pathes.


 We imagined that it could be possible to add at the beginning  
(or end) of this "bad path" just one line that
 would basically do an atomic inc. So, in terms of CPU cycles,  
something like 1 for the inc and maybe 1 jump
 before. Are a couple of cycles really an issue in slow pathes  
(which take at least hundreds of cycles), or do

 you fear out-of-cache memory accesses - or something else ?

 As for outputs, they indeed are slow (and can slow down  
considerably an application if not synchronized), but
 aggregation on the head node should solve our problems. And if  
not, we can also disable outputs at runtime.


 So, in my opinion, no application should notice a difference  
(unless you tune the framework to output every

 warning).

 Sylvain
On Tue, 26 May 2009, Jeff Squyres wrote:

 Nadia --

 Sorry I didn't get to 

[OMPI devel] [RFC] Low pressure OPAL progress

2009-06-08 Thread Sylvain Jeaugey
What : when nothing has been received for a very long time - e.g. 5 
minutes, stop busy polling in opal_progress and switch to a usleep-based 
one.


Why : when we have long waits, and especially when an application is 
deadlock'ed, detecting it is not easy and a lot of power is wasted until 
the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/


Principle
=

opal_progress() ensures the progression of MPI communication. The current 
algorithm is a loop calling progress on all registered components. If the 
program is blocked, the loop will busy-poll indefinetely.


Going to sleep after a certain amount of time with nothing received is 
interesting for two things :
 - Administrator can easily detect whether a job is deadlocked : all the 
processes are in sleep(). Currently, all processors are using 100% cpu and 
it is very hard to know if progression is still happening or not.

 - When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go to 
sleep just before the message arrives. This will highly depend on the 
parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep takes 
T usec, then sleeping after 1xT should slow down Receives by a factor 
less than 0.01 %.


However, other processes may suffer from you being late, and be delayed by 
T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect far-too-long-waits and 
should quite never be used in normal MPI jobs. It could also trigger a 
warning message when starting to sleep, or at least a trace in the 
notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
 * opal_progress_sleep_count : number of unsuccessful opal_progress() 
calls before we start the timer (to prevent latency impact). It defaults 
to -1, which completely deactivates the sleep (and is therefore equivalent 
to the former code). A value of 1000 can be thought of as a starting point 
to enable this mechanism.
 * opal_progress_sleep_trigger : time to wait before going to 
low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
 * opal_progress_sleep_duration : time we sleep at each further 
unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.


The duration is big enough to make the process show 0% CPU in top, but low 
enough to preserve a good trigger/duration ratio.


The trigger is voluntary high to keep a good trigger/duration ratio. 
Indeed, to prevent delays from causing chain reactions, trigger should be 
higher than duration * numprocs.


Possible Improvements & Pitfalls


* Trigger could be set automatically at max(trigger, duration * numprocs * 
2).


* poll_start and poll_count could be fields of the opal_condition_t 
struct.


* The sleep section may be exported in a #define and reported in all the 
progress pathes (I'm not sure my patch is good for progress threads for 
example)


Re: [OMPI devel] Multi-rail on openib

2009-06-08 Thread NiftyOMPI Tom Mitchell
On 6/8/09, Sylvain Jeaugey  wrote:
> Hi Tom,
>
> Yes, there is a goal in mind, and definetly not performance : we are
> working on device failover, i.e when a network adapter or switch fails,
> use the remaining one. We don't intend to improve performance with
> multi-rail (which as you said, will not happen unless you have a DDR card
> with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the
> doubled network :)).

??? dual rail does double the number of switch ports.
If you want to address switch failure each rail must connect to
a different switch.   If you do not want to have isolated fabrics
you must have some additional  ports on all switches to
connect the two fabrics and enough of them to maintain sufficient
bandwidth and connectivity when a switch  fails.  Thus, You are doubling
the fabric unless I am missing something.  Is your second set
of switches so minimally connected  that  the second tree can
be installed with a small switch count.

What are the odds when port 1 fails that port 2 is going to
be live.  Cable/ connector errors would be the most likely
case where port 2 would be live.  In general if port 1 fails
I would expect port 2 to have issues too.

>
> The goal here is to use port 1 of each card as a primary way of
> communication with a fat tree and port 2 as a failover solution with a
> very light network, just to avoid aborting the MPI app or at least reach a
> checkpoint.

Most of the IB protocols used by MPI target a LID.   There is no
existing notification path I know of that can replace LID-xyz with
LID-123.  The subnet manager might be able to do this but begs
security issues.

Interesting problem.

> Don't worry, another team is working on opensm, so that routing stays
> optimal.

Could be fun but I would hope that this not be an incompatible fork.


> Thanks for your warnings however, it's true that a lot of people see these
> "double port IB cards" as "doubled performance".
>
> Sylvain
>
> On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote:
>
>> On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote:
>>>
>>> See this FAQ entry for a description:
>>>
>>> http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup
>>>
>>> Right now, there's no way to force a particular connection pattern on
>>> the openib btl at run-time.  The startup sequence has gotten
>>> sufficiently complicated / muddied over the years that it would be quite
>>> difficult to do so.  Pasha is in the middle of revamping parts of the
>>> openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be
>>> desirable to fully clean up the full openib btl startup sequence when
>>> he's all finished.
>>>
>>>
>>> On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote:
>>>
 Hi all,

 I am working on  multi-rail IB and I was wondering how connections are
 established between ports.  I have two hosts, each with 2 ports on a
 same IB card, connected to the same switch.

>>
>> Is there a goal in mind?
>>
>> In general multi-rail cards run into bandwidth and congestion issues
>> with the host bus.  If your card's system side interface cannot support
>> the bandwidth of twin IB links then it is possible that bandwidth would
>> be reduced by the interaction.
>>
>> If the host bus and memory system is fast enough then
>> work with the vendor.
>>
>> In addition to system bandwidth the subnet manager may need to be enhanced
>> to be multi-port card aware.   Since IB fabric routes are static it is
>> possible
>> to route or use pairs of links in an identical enough way that there is
>> little bandwidth gain when multiple switches are involved.
>>
>> Your two host case case may be simple enoughto explore
>> and/or generate illuminating or misleading results.
>> It is a good place to start.
>>
>> Start with a look at opensm and the fabric then watch how Open MPI
>> or your applications use the resulting LIDs.  If you are using IB directly
>> and not MPI then the list of protocol choices grows dramatically but still
>> centers on LIDs as assigned by the subnet manager (see opensm).
>>
>> How man CPU cores (ranks) are you working with?
>>
>> Do be specific about the IB hardware and associated firmware
>> there are multiple choices out there and the vendor may be able to
>> help...
>>
>> --
>>  T o m  M i t c h e l l
>>  Found me a new hat, now what?
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


-- 
NiftyOMPI
T o m   M i t c h e l l


Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-08 Thread Ralph Castain
I'm not entirely convinced this actually achieves your goals, but I  
can see some potential benefits. I'm also not sure that power  
consumption is that big of an issue that MPI needs to begin chasing  
"power saver" modes of operation, but that can be a separate debate  
some day.


I'm assuming you don't mean that you actually call "sleep()" as this  
would be very bad - I'm assuming you just change the opal_progress  
"tick" rate instead. True? If not, and you really call "sleep", then I  
would have to oppose adding this to the code base pending discussion  
with others who can corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a  
"configure-in" capability. Just having the params default to a value  
that causes the system to behave similarly to today isn't enough - we  
still wind up adding logic into a very critical timing loop for no  
reason. A simple configure option of --enable-mpi-progress-monitoring  
would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g. 5  
minutes, stop busy polling in opal_progress and switch to a usleep- 
based one.


Why : when we have long waits, and especially when an application is  
deadlock'ed, detecting it is not easy and a lot of power is wasted  
until the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

Principle
=

opal_progress() ensures the progression of MPI communication. The  
current algorithm is a loop calling progress on all registered  
components. If the program is blocked, the loop will busy-poll  
indefinetely.


Going to sleep after a certain amount of time with nothing received  
is interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all  
the processes are in sleep(). Currently, all processors are using  
100% cpu and it is very hard to know if progression is still  
happening or not.

- When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go  
to sleep just before the message arrives. This will highly depend on  
the parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep  
takes T usec, then sleeping after 1xT should slow down Receives  
by a factor less than 0.01 %.


However, other processes may suffer from you being late, and be  
delayed by T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect far-too-long- 
waits and should quite never be used in normal MPI jobs. It could  
also trigger a warning message when starting to sleep, or at least a  
trace in the notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress()  
calls before we start the timer (to prevent latency impact). It  
defaults to -1, which completely deactivates the sleep (and is  
therefore equivalent to the former code). A value of 1000 can be  
thought of as a starting point to enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to low- 
pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further  
unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.


The duration is big enough to make the process show 0% CPU in top,  
but low enough to preserve a good trigger/duration ratio.


The trigger is voluntary high to keep a good trigger/duration ratio.  
Indeed, to prevent delays from causing chain reactions, trigger  
should be higher than duration * numprocs.


Possible Improvements & Pitfalls


* Trigger could be set automatically at max(trigger, duration *  
numprocs * 2).


* poll_start and poll_count could be fields of the opal_condition_t  
struct.


* The sleep section may be exported in a #define and reported in all  
the progress pathes (I'm not sure my patch is good for progress  
threads for example)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel