Re: [OMPI devel] RFC: convert send to ssend
Ralph Castain wrote: Not quite that simple, Patrick. Think of things like MPI_Sendrecv, where the "send" call is below that of the user's code. You have a point, Ralph. Although, that would be 8 more lines to add to the user MPI code to define a MPI_Sendrecv macro :-) Seriously, this particular proposal is not the most flaming example of OpenMPI doing too much or going too far. I personally thought that the discussion about affinity was much more revealing in itself, like the part about in effect replacing the OS scheduler. Patrick
Re: [OMPI devel] RFC: convert send to ssend
George Bosilca wrote: I know the approach "because we can". We develop an MPI library, and we should keep it that way. Our main focus should not diverge to provide I would join George in the minority on this one. "Because we can" is a slippery slope, there is value in keeping things simple, having less knobs and bells and whistles. On this particular whistle, the user could add one line to his MPI code to define send to ssend and be done with it. If he does not have the code in the first place, there is nothing he can't do about it anyway. So, it's just a matter of convenience for a lazy user. Patrick
Re: [OMPI devel] Heads up on new feature to 1.3.4
Jeff, Jeff Squyres wrote: ignored it whenever presenting competitive data. The 1,000,000th time I saw this, I gave up arguing that our competitors were not being fair and simply changed our defaults to always leave memory pinned for OpenFabrics-based networks. Instead, you should have told them that caching memory registration is unsafe and ask them why they don't care if their customers don't get the right answer. And then you would follow up by asking if they actually have a way to check that there is no data corruption. It's not really FUD, it's tit for tat :-) 2. Even if you tag someone in public for not being fair, they always say the same thing, "Oh sorry, my mistake" (regardless of whether they actually forgot or did it intentionally). I told several competitors *many times* that they had to use leave_pinned, but in all public comparison numbers, they never did. Hence, they always looked better. Looked better on what, micro-benchmarks ? The same micro-benchmarks that have already been manipulated to death, like OSU using a stream-based bandwidth test to hide the start-up overhead ? If the option improves real applications at large, then it should be on by default and there is no debate (users should never have to know about knobs). If it is only for micro-benchmarks, stand your ground and do the right thing. It does not do the community any good if MPI implementations are tuned for a broken micro-benchmarks penis contest. If you want to play that game, at least make your own micro-benchmarks. Believe me, I know what it is to hear technical atrocities from these marketing idiots. There is nothing you can do, they are payed to talk and you are not. In the end, HPC gets what HPC deserves, people should do their homework. For applications at large, performance gains due to core-binding is suspect. Memory-binding may have more spine, but the OS should already be able do a good job with NUMA allocation and page migration. - The Linux scheduler does no/cannot optimize well for many HPC apps; binding definitely helps in many scenarios (not just benchmarks). Then fix the Linux scheduler. Only the OS scheduler can do a meaningful resource allocation, because it sees everything and you don't. Patrick
Re: [OMPI devel] SM init failures
Jeff Squyres wrote: Why not? The "owning" process can do the touch; then it'll be affinity'ed properly. Right? Yes, that's what I meant by forcing allocation. From the thread, it looked like nobody touched the pages of the mapped file. If it's already done, no need to write in the whole file. Patrick
Re: [OMPI devel] SM init failures
George Bosilca wrote: performance hit on the startup time. And second, we will have to find a pretty smart way to do this or we will completely break the memory affinity stuff. I didn't look at the code, but I sure hope that the SM init code does touch each page to force allocation, otherwise there is no memory affinity stuff at all... Patrick
Re: [OMPI devel] 1.3.1 fails with GM
Hi Christian, Christian Siebert wrote: I just gave the new release 1.3.1 a go. While Ethernet and InfiniBand seem to work properly, I noticed that Myrinet/GM compiles fine but gives a segmentation violation in the first attempt to communicate (MPI_Send in a simple "hello world" application). Is GM not supported anymore or is it just too old so that nobody tested it? GM itself is supported and maintenance releases are tested (no more development releases), but Open-MPI/GM is not tested at the moment. GM does not run on Myri-10G NICs, so we have to use a smaller pool of machines with Myrinet 2000 NICs in them. Human usage and MTT runs for Open-MPI/MX have priority and MTT for Open-MPI/GM has not run for a while :-( We will try to resume MTT testing with Open-MPI/GM when we have the resources. In the meantime, we'll look into the segfault. Patrick
Re: [OMPI devel] RFC: sm Latency
Eugene Loh wrote: Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE. If that was your question, the answer is we punt. We report failure to the ULP, which reverts to the standard code path. Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll the right queue only once and if it fails then you fall back on polling all queues ? If yes, then it's unobtrusive but I don't think it would help much. If you poll the right queue many times, then you have to decide when to fall back on polling all queues, and it's not trivial. How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ? There are a variety of choices here. Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something). In the vast majority of cases, users don't know how to turn the knobs. The problem is that with local np going up, queue sizes will go down fast (square root), and you will have to poll all queues more often. Using more memory for queues just pushed the scalability wall a little bit further. congestion. What if then the user code posts a rather specific request (receive a message with a particular tag on a particular communicator from a particular source) and with high urgency (blocking request... "I ain't going anywhere until you give me what I'm asking for"). A good servant would drop whatever else s/he is doing to oblige the boss. If you poll only one queue, then stuff can pile up on another and a sender is now blocked. At best, you have a synchronization point. At worst, a deadlock. So, let's say there's a standard MPI_Recv. Let's say there's also some congestion starting to build. What should the MPI implementation do? The MPI implementation cannot trust the user/app to indicates where the messages will come from. So, if you have N incoming queues, you need to poll them all eventually. If you do, polling time increase linearly. If you try to limit the polling space with whatever heuristic (like the queue corresponding to the current blocking receive), then you take the risk of not consuming fast enough another queue. And usually, the heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous receives, etc). Really, only single-queue solves that. Yes, and you could toss the receive-side optimizations as well. So, one could say, "Our np=2 latency remains 2x slower than Scali's, but at least we no longer have that hideous scaling with large np." Maybe that's where we want to end up. I think all optimizations except recvi() are fine and worth using. I am just saying that the recvi() optimization is dubious as it is, and the single-queue is potentially a larger hanging fruit on the recv side: it could still be fast (spinlock or atomic to manage shared receive queue) to have lower np=2 latency, and it would scale well with large np. No tuning needed, no special cases, smaller memory footprint. I will leave it at that, just some inputs. Patrick
Re: [OMPI devel] RFC: sm Latency
Eugene, All my remarks are related to the receive side. I think the send side optimizations are fine, but don't take my word for it. Eugene Loh wrote: > To recap: > 1) The work is already done. How do you do "directed polling" with ANY_TAG ? How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ? What about the one-sided that Brian mentioned where there is no corresponding receive to tell you which queue to poll ? If you want to handle all the constraints, a single-queue model is much less work in the end, IMHO. > 2) The single-queue model addresses only one of the RFC's issues. The single-queue model addresses not only the latency overhead when scaling, but also the exploding memory footprint. In many ways, these problems are the same that plagued the RDMA QP model, and the only solution was using shared receive queues. By experience, the linear overhead of polling N queues very quickly become greater than all the optimizations you can do on the send side. > 3) I'm a fan of the single-queue model, but it's just a separate discussion. No problem. You are the one doing the real work here, the rest is armchair quarterbacking :-) Patrick
Re: [OMPI devel] RFC: sm Latency
Hi Eugene, Eugene Loh wrote: >> replace the fifo’s with a single link list per process in shared >> memory, with senders to this process adding match envelopes >> atomically, with each process reading its own link list (multiple > *) Doesn't strike me as a "simple" change. Actually, it's much simpler than trying to optimize/scale the N^2 implementation, IMHO. > *) Not sure this addresses all-to-all well. E.g., let's say you post a > receive for a particular source. Do you then wade through a long FIFO > to look for your match? The tradeoff is between demultiplexing by the sender, which cost in time and in space, or by the receiver, which cost an atomic inc. ANY_TAG forces you to demultiplex on the receive side anyway. Regarding all-to-all, it won't be more expensive if the receives are pre-posted, and they should be. > What the RFC talks about is not the last SM development we'll ever > need. It's only supposed to be one step forward from where we are > today. The "single queue per receiver" approach has many advantages, > but I think it's a different topic. But is this intermediate step worth it or should we (well, you :-) ) go directly for the single queue model ? Patrick
Re: [OMPI devel] 1.3 PML default choice
Jeff Squyres wrote: Gaah! I specifically asked Patrick and George about this and they said that the README text was fine. Grr... When I looked at that time, I vaguely remember that _both_ PMLs were initialized but CM was eventually used because it was the last one. It looked broken, but it worked in the end (MTL was used with CM PML). I don't know if that behavior changed since. Patrick
Re: [OMPI devel] shared-memory allocations
Richard Graham wrote: Yes - it is polling volatile memory, so has to load from memory on every read. Actually, it will poll in cache, and only load from memory when the cache coherency protocol invalidates the cache line. Volatile semantic only prevents compiler optimizations. It does not matter much where the pages are (closer to reader or receiver) on NUMAs, as long as they are equally distributed among all sockets (ie the choice is consistent). Cache prefetching is slightly more efficient on local socket, so closer to reader may be a bit better. Patrick
Re: [OMPI devel] More README questions
Jeff Squyres wrote: - There's a big chunk of text about MX that I have no idea if it's still up-to-date / correct or not. Looks good to me. Patrick
[OMPI devel] mallopt and registration cache
Gentlemen, I have been looking at a data corruption with the MX btl or mtl with the 1.3 branch when trying to use MX registration cache. The related ticket is #1525, opened by Tim. In 1.3, mallopt() is used to never trim memory, in replacement of the malloc overload by ptmalloc2. MX provides its own malloc hooks, but they can't work when the lib is dlopen()ed, so MX has to rely on OMPI to make the registration cache safe. Apparently, mallopt() is only called in the initialization of the mpool component. However, MX btl or mtl do not use the mpool. There is a mallopt memory module in opal, but it assumes that the mpool is used. What is the best way to fix this issue ? * move the mallopt calls out of the mpool init. * use a fake mpool in the MX btl and mtl. * duplicate the mallopt calls directly in the MX btl and mtl. I got lost looking at the mpool code, so I may be completely wrong here. Patrick
Re: [OMPI devel] RFC: make mpi_leave_pinned=1 the default
Jeff Squyres wrote: WHAT: make mpi_leave_pinned=1 by default when a BTL is used that would benefit from it (when possible; 0 when not, obviously) Comments? The probable reason registration cache (aka leave_pinned) is disabled by default is that it may be unsafe. Even if you use mallocopt to never return memory to the OS, how do you guarantee that: * malloc always enforce the mallocopt *hints*. * pinned memory can safely be fork()ed (system() for example). * pinned memory can safely be unmmap()ed (Direct I/O or file mapping for example). If you can't, one solution may be to write a simple MPI code that corrupts MVAPICH and make some noise about it. My 2 cents. Patrick
Re: [OMPI devel] Notes from mem hooks call today
Hi Roland, Roland Dreier wrote: Stick in a separate library then? I don't think we want the complexity in the kernel -- I personally would argue against merging it upstream; and given that the userspace solution is actually faster, it becomes pretty hard to justify. Memory registration has always been expensive, so it's not in the critical path (not used for small messages and a system call overhead is nothing for large messages in MPI). Sure, you can have the kernel notify the user space through mapped flags, but it's a bit ugly IMHO. There are cases where the basic registration already uses the same infrastructure as a regcache. For example, on Solaris, MacOSX and Linux PowerPC, you really want to register segments as large as possible to limit the IOMMU overhead. You also don't want to register multiple time the same page with overlapping registrations, because the IOMMU space is limited. In short, you already have a registration cache in the driver. However, if the user space is expected to call register/deregister often, then I agree that the cache better be in user space. The big picture is that it's not really important where the regcache lives, as long as it's out of MPI. Patrick
Re: [OMPI devel] Memory hooks stuff
Hi Jeff, Jeff Squyres wrote: the topic of the memory hooks came up again. Brian was wondering if we should [finally] revisit this topic -- there's a few things that could be done to make life "better". Two things jump to mind: - using mallopt on Linux What about using the (probably) upcoming mmu notifiers and avoid ugly hacks in user space ? - doing *something* on Solaris Implementing the same kind of notifiers in Solaris ? Patrick
Re: [OMPI devel] RFC: Linuxes shipping libibverbs
Brian W. Barrett wrote: With MX, it's one initialization call (mx_init), and it's not clear from the errors it can return that you can differentiate between the two cases. If you run mx_init() on a machine without the MX driver loaded or no NIC detected by the driver, you get a specific error code (MX_NO_DEV) and the default error handler print something like: MX:asterix:mx_init:querying driver:error 5(errno=2):No MX device entry in /dev. You can overload the default error handler to not see the message. Patrick
Re: [OMPI devel] patch for building gm btl
Paul, Paul H. Hargrove wrote: discuss what tests we will run, but it will probably be a very minimal set. Once we both have MTT setup and running GM tests, we should compare configs to avoid overlap (and thus increase coverage). That would be great. I have only one 32-node 2G cluster I can use full-time for MTT testing for GM, MX, OpenMPI, MPICH{1,2}, HP-MPI, and many more. One thing I quickly learned with MTT is that there is only 24 hours in a day :-) Patrick
Re: [OMPI devel] patch for building gm btl
Hi Paul, Paul H. Hargrove wrote: The fact that this has gone unfixed for 2 months suggests to me that nobody is building the GM BTL. So, how would I go about checking ... a) ...if there exists any periodic build of the GM BTL via MTT? We are deploying MTT on all our clusters. Right now, we use our own MTT server, but we will report a subset of the test to the OpenMPI server once everything is working. c) ...which GM library versions such builds, if any, compile against There is no GM tests currently under our still-evolving MTT setup. Once we have a working setup, we will run a single Pallas test on 32 nodes with GM-2.1.28, two 2G NICs per node (single and dual port). There is no active development on GM, just kernel updates, so the GM version does not matter much. Patrick
Re: [OMPI devel] SDP support for OPEN-MPI
Lenny Verkhovsky wrote: We would like to add SDP support for OPENMPI. SDP can be used to accelerate job start ( oob over sdp ) and IPoIB performance. I fail to see the reason to pollute the TCP btl with IB-specific SDP stuff. For the oob, this is arguable, but doesn't SDP allow for *transparent* socket replacement at runtime ? In this case, why not use this mechanism and keep the code clean ? Patrick
Re: [OMPI devel] Dynamically Turning On and Off Memory Manager of Open MPI at Runtime??
Hi Peter, Peter Wong wrote: Open MPI defines its own malloc (by default), so malloc of glibc is not called. But, without calling malloc of glibc, the allocator of libhugetlbfs to back text and dynamic data by large pages, e.g., 16MB pages on POWER systems, is not used. You could modify ptmalloc2 in OpenMPI to allocate Huge Pages directly. It would be a nice feature. Patrick
Re: [OMPI devel] collective problems
Hi Gleb, Gleb Natapov wrote: In the case of TCP, kernel is kind enough to progress message for you, but only if there was enough space in a kernel internal buffers. If there was no place there, TCP BTL will also buffer messages in userspace and will, eventually, have the same problem. Occasionally buffering to hide flow-control issue is fine, assuming that there is a mechanism to flush the buffer (below). However, you cannot buffer everything and it is just as fine to expose the back pressure when the buffer space is exhausted, to show the application that there is a sustained problem. In this case, it is reasonable to block the application (ie the MPI request) while you cannot buffer the outgoing data. The problem of the progression of already buffered outgoing data is the real problem, not the buffering itself. Here, the proposal is to allow the BTL to buffer, but requires the PML to handle progress. That's broken, IMHO. To progress such outstanding messages additional thread is needed in userspace. Is this what MX does? MX uses user-level thread but it's mainly for progressing the higher-level protocol on the receive side. On the send side for the low-level protocol, it is easier to ask your driver to either wake you up when the sending resource is available again (blocking on a CQ for IB) or take care of the sending itself. My overall problem with this proposal is a race to the bottom, based on the lowest BTL, functionality-wise. The PML already imposes a pipelining for large messages (with a few knobs, but still) when most protocols in other BTLs already have their own. Now it's flow-control progression (not MPI progression). Can each BTL implement what is needed for a particular back-end instead of bloating the upper layer ? Patrick
Re: [OMPI devel] collective problems
Jeff Squyres wrote: This is not a problem in the current code base. Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits). Sorry if I miss something obvious, but why does the PML has to be aware of the flow control situation of the BTL ? If the BTL cannot send something right away for any reason, it should be the responsibility of the BTL to buffer it and to progress on it later. Patrick
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
Hi Bogdan, Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. We have had portability problems using ptmalloc2 in MPICH-GM, specially relative to threads. In MX, we choose to use dlmalloc instead. It is not as optimized and its thread-safety has a coarser grain, but it is much more portable. Disabling the memory manager in OpenMPI is not a bad thing for MX, as its own dlmalloc-based registration cache will operate transparently with MX_RCACHE=1 (default). Patrick
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r15041
Jeff Squyres wrote: Let's take a step back and see exactly what we *want*. Then we can talk about how to have an interface for it. I must be missing something but why is the bandwidth/latency passed by the user (by whatever means) ? Would it be easier to automagically get these values by probing the hardware or have the BTL do an educated guess ? You can figure out at runtime the link rate of an eth device for example. You would want have a complicated way to force any value, but the default should be invisible, no ? Patrick
Re: [OMPI devel] Best bw/lat performance for microbenchmark/debug utility
Jeff Squyres (jsquyres) wrote: -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Patrick Geoffray Sent: Wednesday, June 28, 2006 1:23 PM To: Open MPI Developers Subject: Re: [OMPI devel] Best bw/lat performance for microbenchmark/debug utility Josh Aune wrote: I am writing up some interconnect/network debugging software that is centered around ompi. What is the best set of functions to I was assuming that you would be testing latency/bandwidth, but Patrick is correct in stating that there are many more things to test than just those two metrics. There are a lot of metrics, but most of them require deep understanding of the MPI semantics and implementation details to make sense. The art of micro-benchmark is to choose the metrics and explain why they matter. It's obvious for latency/bandwidth, a bit less for unexpected and host overhead, definitively hard for overlap and progress. And that's just for point-to-point. To avoid reinventing the wheel, I would suggest to Josh to develop a micro-benchmark test suite to compute a very detailed LogP-derived parameters, ie for all message sizes: * send overhead (o.s) and recv overhead (o.r). These overheads will likely be either constant or linear for various message size ranges, it would be great to automatically compute the ranges. Memory registration cost is accounted here, so it would useful to measure with and without registration cache also. * Latency (L). * Send gap (g.s) and recv gap (g.r). For large messages, they will likely be identical and represent the link bandwidth. For smaller messages, the send gap is the gap of a fan-out pattern (1->N) and the recv gap is the gap of a flat gather (N->1). It's important to not have the send or recv overhead hiding the send or recv gap, using several processes could be used to dive the send/recv overhead. * unexpected overhead (o.u). Overhead added to (o.r) when the message is not immediately matched. * overlap availability (a) that is the percentage of communication time that you can overlap with real host computation. From these parameters, you can derive pretty much all characteristics of an interconnect without contention. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
Re: [OMPI devel] Best bw/lat performance for microbenchmark/debug utility
Josh Aune wrote: I am writing up some interconnect/network debugging software that is centered around ompi. What is the best set of functions to use to get the best bandwidth and latency numbers for openmpi and why? I've been You mean MPI functions or internal ompi functions ? For MPI functions, it depends of what you are looking for. Send/recv is fine but it does not show the overlap capability. You would need to do something smarter with Isend/Irecv/Wait for that (Sandia has a nice bench that they should release soon). You may also want to measure the penalty for unexpected messages, the host CPU overhead and the ability to progress. All of these metrics are measured by existing benchmarks, do you want to write one that covers everything or something like IMB ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com