Re: [OMPI devel] RFC: convert send to ssend

2009-08-25 Thread Patrick Geoffray

Ralph Castain wrote:
Not quite that simple, Patrick. Think of things like MPI_Sendrecv, where 
the "send" call is below that of the user's code.


You have a point, Ralph. Although, that would be 8 more lines to add to 
the user MPI code to define a MPI_Sendrecv macro :-)


Seriously, this particular proposal is not the most flaming example of 
OpenMPI doing too much or going too far. I personally thought that the 
discussion about affinity was much more revealing in itself, like the 
part about in effect replacing the OS scheduler.


Patrick


Re: [OMPI devel] RFC: convert send to ssend

2009-08-24 Thread Patrick Geoffray

George Bosilca wrote:
I know the approach "because we can". We develop an MPI library, and we 
should keep it that way. Our main focus should not diverge to provide 


I would join George in the minority on this one. "Because we can" is a 
slippery slope, there is value in keeping things simple, having less 
knobs and bells and whistles.


On this particular whistle, the user could add one line to his MPI code 
to define send to ssend and be done with it. If he does not have the 
code in the first place, there is nothing he can't do about it anyway. 
So, it's just a matter of convenience for a lazy user.


Patrick


Re: [OMPI devel] Heads up on new feature to 1.3.4

2009-08-17 Thread Patrick Geoffray

Jeff,



Jeff Squyres wrote:
ignored it whenever presenting competitive data.  The 1,000,000th time I 
saw this, I gave up arguing that our competitors were not being fair and 
simply changed our defaults to always leave memory pinned for 
OpenFabrics-based networks.


Instead, you should have told them that caching memory registration is 
unsafe and ask them why they don't care if their customers don't get the 
right answer. And then you would follow up by asking if they actually 
have a way to check that there is no data corruption. It's not really 
FUD, it's tit for tat :-)


2. Even if you tag someone in public for not being fair, they always say 
the same thing, "Oh sorry, my mistake" (regardless of whether they 
actually forgot or did it intentionally).  I told several competitors 
*many times* that they had to use leave_pinned, but in all public 
comparison numbers, they never did.  Hence, they always looked better.


Looked better on what, micro-benchmarks ? The same micro-benchmarks that 
have already been manipulated to death, like OSU using a stream-based 
bandwidth test to hide the start-up overhead ? If the option improves 
real applications at large, then it should be on by default and there is 
no debate (users should never have to know about knobs). If it is only 
for micro-benchmarks, stand your ground and do the right thing. It does 
not do the community any good if MPI implementations are tuned for a 
broken micro-benchmarks penis contest. If you want to play that game, at 
least make your own micro-benchmarks.


Believe me, I know what it is to hear technical atrocities from these 
marketing idiots. There is nothing you can do, they are payed to talk 
and you are not. In the end, HPC gets what HPC deserves, people should 
do their homework.


For applications at large, performance gains due to core-binding is 
suspect. Memory-binding may have more spine, but the OS should already 
be able do a good job with NUMA allocation and page migration.


- The Linux scheduler does no/cannot optimize well for many HPC apps; 
binding definitely helps in many scenarios (not just benchmarks).


Then fix the Linux scheduler. Only the OS scheduler can do a meaningful 
resource allocation, because it sees everything and you don't.




Patrick


Re: [OMPI devel] SM init failures

2009-03-30 Thread Patrick Geoffray

Jeff Squyres wrote:
Why not?  The "owning" process can do the touch; then it'll be 
affinity'ed properly.  Right?


Yes, that's what I meant by forcing allocation. From the thread, it 
looked like nobody touched the pages of the mapped file. If it's already 
done, no need to write in the whole file.


Patrick


Re: [OMPI devel] SM init failures

2009-03-30 Thread Patrick Geoffray

George Bosilca wrote:
performance hit on the startup time. And second, we will have to find a 
pretty smart way to do this or we will completely break the memory 
affinity stuff.


I didn't look at the code, but I sure hope that the SM init code does 
touch each page to force allocation, otherwise there is no memory 
affinity stuff at all...


Patrick


Re: [OMPI devel] 1.3.1 fails with GM

2009-03-20 Thread Patrick Geoffray

Hi Christian,

Christian Siebert wrote:
I just gave the new release 1.3.1 a go. While Ethernet and InfiniBand 
seem to work properly, I noticed that Myrinet/GM compiles fine but gives 
a segmentation violation in the first attempt to communicate (MPI_Send 
in a simple "hello world" application). Is GM not supported anymore or 
is it just too old so that nobody tested it?


GM itself is supported and maintenance releases are tested (no more 
development releases), but Open-MPI/GM is not tested at the moment. GM 
does not run on Myri-10G NICs, so we have to use a smaller pool of 
machines with Myrinet 2000 NICs in them. Human usage and MTT runs for 
Open-MPI/MX have priority and MTT for Open-MPI/GM has not run for a 
while :-(


We will try to resume MTT testing with Open-MPI/GM when we have the 
resources. In the meantime, we'll look into the segfault.


Patrick


Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Patrick Geoffray

Eugene Loh wrote:
Possibly, you meant to ask how one does directed polling with a wildcard 
source MPI_ANY_SOURCE.  If that was your question, the answer is we 
punt.  We report failure to the ULP, which reverts to the standard code 
path.


Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to 
a posted receive, you only optimize micro-benchmarks, until they start 
using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll 
the right queue only once and if it fails then you fall back on polling 
all queues ? If yes, then it's unobtrusive but I don't think it would 
help much. If you poll the right queue many times, then you have to 
decide when to fall back on polling all queues, and it's not trivial.



How do you ensure you check all incoming queues from time to time to prevent 
flow control (specially if the queues are small for scaling) ?
There are a variety of choices here.  Further, I'm afraid we ultimately 
have to expose some of those choices to the user (MCA parameters or 
something).


In the vast majority of cases, users don't know how to turn the knobs. 
The problem is that with local np going up, queue sizes will go down 
fast (square root), and you will have to poll all queues more often. 
Using more memory for queues just pushed the scalability wall a little 
bit further.


congestion.  What if then the user code posts a rather specific request 
(receive a message with a particular tag on a particular communicator 
from a particular source) and with high urgency (blocking request... "I 
ain't going anywhere until you give me what I'm asking for").  A good 
servant would drop whatever else s/he is doing to oblige the boss.


If you poll only one queue, then stuff can pile up on another and a 
sender is now blocked. At best, you have a synchronization point. At 
worst, a deadlock.


So, let's say there's a standard MPI_Recv.  Let's say there's also some 
congestion starting to build.  What should the MPI implementation do?  


The MPI implementation cannot trust the user/app to indicates where the 
messages will come from. So, if you have N incoming queues, you need to 
poll them all eventually. If you do, polling time increase linearly. If 
you try to limit the polling space with whatever heuristic (like the 
queue corresponding to the current blocking receive), then you take the 
risk of not consuming fast enough another queue. And usually, the 
heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous 
receives, etc).


Really, only single-queue solves that.

Yes, and you could toss the receive-side optimizations as well.  So, one 
could say, "Our np=2 latency remains 2x slower than Scali's, but at 
least we no longer have that hideous scaling with large np."  Maybe 
that's where we want to end up.


I think all optimizations except recvi() are fine and worth using. I am 
just saying that the recvi() optimization is dubious as it is, and the 
single-queue is potentially a larger hanging fruit on the recv side: it 
could still be fast (spinlock or atomic to manage shared receive queue) 
to have lower np=2 latency, and it would scale well with large np. No 
tuning needed, no special cases, smaller memory footprint.


I will leave it at that, just some inputs.

Patrick


Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Patrick Geoffray
Eugene,

All my remarks are related to the receive side. I think the send side
optimizations are fine, but don't take my word for it.

Eugene Loh wrote:
> To recap:
> 1) The work is already done.

How do you do "directed polling" with ANY_TAG ? How do you ensure you
check all incoming queues from time to time to prevent flow control
(specially if the queues are small for scaling) ? What about the
one-sided that Brian mentioned where there is no corresponding receive
to tell you which queue to poll ?

If you want to handle all the constraints, a single-queue model is much
less work in the end, IMHO.

> 2) The single-queue model addresses only one of the RFC's issues.

The single-queue model addresses not only the latency overhead when
scaling, but also the exploding memory footprint. In many ways, these
problems are the same that plagued the RDMA QP model, and the only
solution was using shared receive queues.

By experience, the linear overhead of polling N queues very quickly
become greater than all the optimizations you can do on the send side.

> 3) I'm a fan of the single-queue model, but it's just a separate discussion.

No problem. You are the one doing the real work here, the rest is
armchair quarterbacking :-)

Patrick


Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Patrick Geoffray
Hi Eugene,

Eugene Loh wrote:
>> replace the fifo’s with a single link list per process in shared 
>> memory, with senders to this process adding match envelopes 
>> atomically, with each process reading its own link list (multiple 


> *) Doesn't strike me as a "simple" change.

Actually, it's much simpler than trying to optimize/scale the N^2
implementation, IMHO.

> *) Not sure this addresses all-to-all well.  E.g., let's say you post a 
> receive for a particular source.  Do you then wade through a long FIFO 
> to look for your match?

The tradeoff is between demultiplexing by the sender, which cost in time
and in space, or by the receiver, which cost an atomic inc. ANY_TAG
forces you to demultiplex on the receive side anyway. Regarding
all-to-all, it won't be more expensive if the receives are pre-posted,
and they should be.

> What the RFC talks about is not the last SM development we'll ever 
> need.  It's only supposed to be one step forward from where we are 
> today.  The "single queue per receiver" approach has many advantages, 
> but I think it's a different topic.

But is this intermediate step worth it or should we (well, you :-) ) go
directly for the single queue model ?

Patrick


Re: [OMPI devel] 1.3 PML default choice

2009-01-13 Thread Patrick Geoffray

Jeff Squyres wrote:
Gaah!  I specifically asked Patrick and George about this and they said 
that the README text was fine.  Grr...


When I looked at that time, I vaguely remember that _both_ PMLs were 
initialized but CM was eventually used because it was the last one. It 
looked broken, but it worked in the end (MTL was used with CM PML). I 
don't know if that behavior changed since.


Patrick


Re: [OMPI devel] shared-memory allocations

2008-12-13 Thread Patrick Geoffray

Richard Graham wrote:
Yes - it is polling volatile memory, so has to load from memory on every 
read.


Actually, it will poll in cache, and only load from memory when the 
cache coherency protocol invalidates the cache line. Volatile semantic 
only prevents compiler optimizations.


It does not matter much where the pages are (closer to reader or 
receiver) on NUMAs, as long as they are equally distributed among all 
sockets (ie the choice is consistent). Cache prefetching is slightly 
more efficient on local socket, so closer to reader may be a bit better.


Patrick


Re: [OMPI devel] More README questions

2008-11-15 Thread Patrick Geoffray

Jeff Squyres wrote:
- There's a big chunk of text about MX that I have no idea if it's still 
up-to-date / correct or not.


Looks good to me.

Patrick


[OMPI devel] mallopt and registration cache

2008-10-31 Thread Patrick Geoffray

Gentlemen,

I have been looking at a data corruption with the MX btl or mtl with the 
1.3 branch when trying to use MX registration cache. The related ticket 
is #1525, opened by Tim.


In 1.3, mallopt() is used to never trim memory, in replacement of the 
malloc overload by ptmalloc2. MX provides its own malloc hooks, but they 
can't work when the lib is dlopen()ed, so MX has to rely on OMPI to make 
the registration cache safe. Apparently, mallopt() is only called in the 
initialization of the mpool component. However, MX btl or mtl do not use 
the mpool. There is a mallopt memory module in opal, but it assumes that 
the mpool is used.


What is the best way to fix this issue ?
* move the mallopt calls out of the mpool init.
* use a fake mpool in the MX btl and mtl.
* duplicate the mallopt calls directly in the MX btl and mtl.

I got lost looking at the mpool code, so I may be completely wrong here.

Patrick


Re: [OMPI devel] RFC: make mpi_leave_pinned=1 the default

2008-07-06 Thread Patrick Geoffray

Jeff Squyres wrote:
WHAT: make mpi_leave_pinned=1 by default when a BTL is used that would 
benefit from it (when possible; 0 when not, obviously)



Comments?


The probable reason registration cache (aka leave_pinned) is disabled by 
default is that it may be unsafe. Even if you use mallocopt to never 
return memory to the OS, how do you guarantee that:

* malloc always enforce the mallocopt *hints*.
* pinned memory can safely be fork()ed (system() for example).
* pinned memory can safely be unmmap()ed (Direct I/O or file mapping for 
example).


If you can't, one solution may be to write a simple MPI code that 
corrupts MVAPICH and make some noise about it.


My 2 cents.

Patrick


Re: [OMPI devel] Notes from mem hooks call today

2008-05-29 Thread Patrick Geoffray

Hi Roland,

Roland Dreier wrote:

Stick in a separate library then?

I don't think we want the complexity in the kernel -- I personally would
argue against merging it upstream; and given that the userspace solution
is actually faster, it becomes pretty hard to justify.


Memory registration has always been expensive, so it's not in the 
critical path (not used for small messages and a system call overhead is 
nothing for large messages in MPI). Sure, you can have the kernel notify 
the user space through mapped flags, but it's a bit ugly IMHO.


There are cases where the basic registration already uses the same 
infrastructure as a regcache. For example, on Solaris, MacOSX and Linux 
PowerPC, you really want to register segments as large as possible to 
limit the IOMMU overhead. You also don't want to register multiple time 
the same page with overlapping registrations, because the IOMMU space is 
limited. In short, you already have a registration cache in the driver.


However, if the user space is expected to call register/deregister 
often, then I agree that the cache better be in user space.


The big picture is that it's not really important where the regcache 
lives, as long as it's out of MPI.


Patrick


Re: [OMPI devel] Memory hooks stuff

2008-05-24 Thread Patrick Geoffray

Hi Jeff,

Jeff Squyres wrote:
the topic of the memory hooks came up again.  Brian was wondering if  
we should [finally] revisit this topic -- there's a few things that  
could be done to make life "better".  Two things jump to mind:


- using mallopt on Linux


What about using the (probably) upcoming mmu notifiers and avoid ugly 
hacks in user space ?



- doing *something* on Solaris


Implementing the same kind of notifiers in Solaris ?

Patrick


Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-22 Thread Patrick Geoffray

Brian W. Barrett wrote:
With MX, it's one initialization call (mx_init), and it's not clear from 
the errors it can return that you can differentiate between the two cases.


If you run mx_init() on a machine without the MX driver loaded or no NIC 
detected by the driver, you get a specific error code (MX_NO_DEV) and 
the default error handler print something like:


MX:asterix:mx_init:querying driver:error 5(errno=2):No MX device entry 
in /dev.


You can overload the default error handler to not see the message.

Patrick


Re: [OMPI devel] patch for building gm btl

2008-01-03 Thread Patrick Geoffray

Paul,

Paul H. Hargrove wrote:
discuss what tests we will run, but it will probably be a very minimal 
set.  Once we both have MTT setup and running GM tests, we should 
compare configs to avoid overlap (and thus increase coverage).


That would be great. I have only one 32-node 2G cluster I can use 
full-time for MTT testing for GM, MX, OpenMPI, MPICH{1,2}, HP-MPI, and 
many more.  One thing I quickly learned with MTT is that there is only 
24 hours in a day :-)


Patrick


Re: [OMPI devel] patch for building gm btl

2008-01-03 Thread Patrick Geoffray

Hi Paul,

Paul H. Hargrove wrote:
The fact that this has gone unfixed for 2 months suggests to me that 
nobody is building the GM BTL.  So, how would I go about checking ...



a) ...if there exists any periodic build of the GM BTL via MTT?


We are deploying MTT on all our clusters. Right now, we use our own MTT 
server, but we will report a subset of the test to the OpenMPI server 
once everything is working.



c) ...which GM library versions such builds, if any, compile against


There is no GM tests currently under our still-evolving MTT setup. Once 
we have a working setup, we will run a single Pallas test on 32 nodes 
with GM-2.1.28, two 2G NICs per node (single and dual port). There is no 
active development on GM, just kernel updates, so the GM version does 
not matter much.


Patrick


Re: [OMPI devel] SDP support for OPEN-MPI

2008-01-01 Thread Patrick Geoffray

Lenny Verkhovsky wrote:

We would like to add SDP support for OPENMPI.



SDP can be used to accelerate job start ( oob over sdp ) and IPoIB
performance.


I fail to see the reason to pollute the TCP btl with IB-specific SDP stuff.

For the oob, this is arguable, but doesn't SDP allow for *transparent* 
socket replacement at runtime ? In this case, why not use this mechanism 
and keep the code clean ?


Patrick


Re: [OMPI devel] Dynamically Turning On and Off Memory Manager of Open MPI at Runtime??

2007-12-10 Thread Patrick Geoffray

Hi Peter,

Peter Wong wrote:

Open MPI defines its own malloc (by default), so malloc of glibc
is not called.

But, without calling malloc of glibc, the allocator of libhugetlbfs
to back text and dynamic data by large pages, e.g., 16MB pages on
POWER systems, is not used.


You could modify ptmalloc2 in OpenMPI to allocate Huge Pages directly. 
It would be a nice feature.


Patrick


Re: [OMPI devel] collective problems

2007-11-08 Thread Patrick Geoffray

Hi Gleb,

Gleb Natapov wrote:

In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.


Occasionally buffering to hide flow-control issue is fine, assuming that 
there is a mechanism to flush the buffer (below). However, you cannot 
buffer everything and it is just as fine to expose the back pressure 
when the buffer space is exhausted, to show the application that there 
is a sustained problem. In this case, it is reasonable to block the 
application (ie the MPI request) while you cannot buffer the outgoing data.


The problem of the progression of already buffered outgoing data is the 
real problem, not the buffering itself.


Here, the proposal is to allow the BTL to buffer, but requires the PML 
to handle progress. That's broken, IMHO.



To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?


MX uses user-level thread but it's mainly for progressing the 
higher-level protocol on the receive side. On the send side for the 
low-level protocol, it is easier to ask your driver to either wake you 
up when the sending resource is available again (blocking on a CQ for 
IB) or take care of the sending itself.



My overall problem with this proposal is a race to the bottom, based on 
the lowest BTL, functionality-wise. The PML already imposes a pipelining 
for large messages (with a few knobs, but still) when most protocols in 
other BTLs already have their own. Now it's flow-control progression 
(not MPI progression).


Can each BTL implement what is needed for a particular back-end instead 
of bloating the upper layer ?



Patrick


Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray

Jeff Squyres wrote:

This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal for  
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send  
was successful, but it has not yet been sent (e.g., openib BTL  
buffered it because it ran out of credits).


Sorry if I miss something obvious, but why does the PML has to be aware 
of the flow control situation of the BTL ? If the BTL cannot send 
something right away for any reason, it should be the responsibility of 
the BTL to buffer it and to progress on it later.


Patrick


Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Patrick Geoffray

Hi Bogdan,

Bogdan Costescu wrote:
I made some progress: if I configure with "--without-memory-manager" 
(along with all other options that I mentioned before), then it works. 
This was inspired by the fact that the segmentation fault occured in 
ptmalloc2. I have previously tried to remove the MX support without 
any effect; with ptmalloc2 out of the picture I have had test runs 
over MX and TCP without problems.


We have had portability problems using ptmalloc2 in MPICH-GM, specially 
relative to threads. In MX, we choose to use dlmalloc instead. It is not 
as optimized and its thread-safety has a coarser grain, but it is much 
more portable.


Disabling the memory manager in OpenMPI is not a bad thing for MX, as 
its own dlmalloc-based registration cache will operate transparently 
with MX_RCACHE=1 (default).


Patrick


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r15041

2007-06-13 Thread Patrick Geoffray

Jeff Squyres wrote:
Let's take a step back and see exactly what we *want*.  Then we can  
talk about how to have an interface for it.


I must be missing something but why is the bandwidth/latency passed by 
the user (by whatever means) ? Would it be easier to automagically get 
these values by probing the hardware or have the BTL do an educated 
guess ? You can figure out at runtime the link rate of an eth device for 
example. You would want have a complicated way to force any value, but 
the default should be invisible, no ?


Patrick


Re: [OMPI devel] Best bw/lat performance for microbenchmark/debug utility

2006-06-29 Thread Patrick Geoffray

Jeff Squyres (jsquyres) wrote:

-Original Message-
From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org] On Behalf Of Patrick Geoffray

Sent: Wednesday, June 28, 2006 1:23 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Best bw/lat performance for 
microbenchmark/debug utility


Josh Aune wrote:

I am writing up some interconnect/network debugging software that is
centered around ompi.  What is the best set of functions to 



I was assuming that you would be testing latency/bandwidth, but Patrick
is correct in stating that there are many more things to test than just
those two metrics.


There are a lot of metrics, but most of them require deep understanding 
of the MPI semantics and implementation details to make sense. The art 
of micro-benchmark is to choose the metrics and explain why they matter. 
It's obvious for latency/bandwidth, a bit less for unexpected and host 
overhead, definitively hard for overlap and progress. And that's just 
for point-to-point.


To avoid reinventing the wheel, I would suggest to Josh to develop a 
micro-benchmark test suite to compute a very detailed LogP-derived 
parameters, ie for all message sizes:
* send overhead (o.s) and recv overhead (o.r). These overheads will 
likely be either constant or linear for various message size ranges, it 
would be great to automatically compute the ranges.
Memory registration cost is accounted here, so it would useful to 
measure with and without registration cache also.

* Latency (L).
* Send gap (g.s) and recv gap (g.r). For large messages, they will 
likely be identical and represent the link bandwidth. For smaller 
messages, the send gap is the gap of a fan-out pattern (1->N) and the 
recv gap is the gap of a flat gather (N->1). It's important to not have 
the send or recv overhead hiding the send or recv gap, using several 
processes could be used to dive the send/recv overhead.
* unexpected overhead (o.u). Overhead added to (o.r) when the message is 
not immediately matched.
* overlap availability (a) that is the percentage of communication time 
that you can overlap with real host computation.


From these parameters, you can derive pretty much all characteristics 
of an interconnect without contention.


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com


Re: [OMPI devel] Best bw/lat performance for microbenchmark/debug utility

2006-06-28 Thread Patrick Geoffray

Josh Aune wrote:

I am writing up some interconnect/network debugging software that is
centered around ompi.  What is the best set of functions to use to get
the best bandwidth and latency numbers for openmpi and why?  I've been


You mean MPI functions or internal ompi functions ? For MPI functions, 
it depends of what you are looking for. Send/recv is fine but it does 
not show the overlap capability. You would need to do something smarter 
with Isend/Irecv/Wait for that (Sandia has a nice bench that they should 
release soon). You may also want to measure the penalty for unexpected 
messages, the host CPU overhead and the ability to progress.


All of these metrics are measured by existing benchmarks, do you want to 
write one that covers everything or something like IMB ?


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com