Re: [OMPI devel] RFC: sm Latency
Ron Brightwell wrote: If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE. Note that the HPCC RandomAccess benchmark only uses MPI_ANY_SOURCE (and MPI_ANY_TAG). But HPCC RandomAccess also just uses non-blocking receives. So, it's kind of outside the scope of the original ideas here (bypassing the PML receive-request data structure). It's possibly not even a poster child for the single-queue idea either. Single queue probably shines best when you have to poll all connections for a few messages. In contrast, RandomAccess (I think) loads all connections up randomly (pseudo-evenly).
Re: [OMPI devel] RFC: sm Latency
Patrick Geoffray wrote: Eugene Loh wrote: Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE. If that was your question, the answer is we punt. We report failure to the ULP, which reverts to the standard code path. Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE. Right. So, does recvi() is a one-time shot ? Ie do you poll the right queue only once and if it fails then you fall back on polling all queues ? You poll it "some". The BTL is granted some leeway in what "immediately" means. If yes, then it's unobtrusive but I don't think it would help much. Well, check the RFC. The data shows huge improvements in HPCC latency. If you poll the right queue many times, then you have to decide when to fall back on polling all queues, and it's not trivial. It's not 100% satisfactory, but clearly OMPI (and every other MPI implementation and just about any major piece of HPC software) is trying to guess among all sorts of trade-offs. Many of those trade-offs are user tunable -- hence, those pages and pages compiler options (pick your favorite compiler), build flags, MCA parameters, etc. How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ? There are a variety of choices here. Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something). In the vast majority of cases, users don't know how to turn the knobs. Totally agree. Exposing these choices to the users is ugly and expecting users to make such choices is ridiculous. Though, for what it's worth: % ompi_info -a | wc -l 1037 % I actually agree with you a lot. I do think that my RFC represents one step forward. I'll see how quickly I can prototype and characterize a single-queue solution so we can judge alternatives more diligently.
Re: [OMPI devel] RFC: sm Latency
> > Possibly, you meant to ask how one does directed polling with a wildcard > > source MPI_ANY_SOURCE. If that was your question, the answer is we > > punt. We report failure to the ULP, which reverts to the standard code > > path. > > Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to > a posted receive, you only optimize micro-benchmarks, until they start > using ANY_SOURCE. > [...] Note that the HPCC RandomAccess benchmark only uses MPI_ANY_SOURCE (and MPI_ANY_TAG). -Ron
Re: [OMPI devel] RFC: sm Latency
Eugene Loh wrote: Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE. If that was your question, the answer is we punt. We report failure to the ULP, which reverts to the standard code path. Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll the right queue only once and if it fails then you fall back on polling all queues ? If yes, then it's unobtrusive but I don't think it would help much. If you poll the right queue many times, then you have to decide when to fall back on polling all queues, and it's not trivial. How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ? There are a variety of choices here. Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something). In the vast majority of cases, users don't know how to turn the knobs. The problem is that with local np going up, queue sizes will go down fast (square root), and you will have to poll all queues more often. Using more memory for queues just pushed the scalability wall a little bit further. congestion. What if then the user code posts a rather specific request (receive a message with a particular tag on a particular communicator from a particular source) and with high urgency (blocking request... "I ain't going anywhere until you give me what I'm asking for"). A good servant would drop whatever else s/he is doing to oblige the boss. If you poll only one queue, then stuff can pile up on another and a sender is now blocked. At best, you have a synchronization point. At worst, a deadlock. So, let's say there's a standard MPI_Recv. Let's say there's also some congestion starting to build. What should the MPI implementation do? The MPI implementation cannot trust the user/app to indicates where the messages will come from. So, if you have N incoming queues, you need to poll them all eventually. If you do, polling time increase linearly. If you try to limit the polling space with whatever heuristic (like the queue corresponding to the current blocking receive), then you take the risk of not consuming fast enough another queue. And usually, the heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous receives, etc). Really, only single-queue solves that. Yes, and you could toss the receive-side optimizations as well. So, one could say, "Our np=2 latency remains 2x slower than Scali's, but at least we no longer have that hideous scaling with large np." Maybe that's where we want to end up. I think all optimizations except recvi() are fine and worth using. I am just saying that the recvi() optimization is dubious as it is, and the single-queue is potentially a larger hanging fruit on the recv side: it could still be fast (spinlock or atomic to manage shared receive queue) to have lower np=2 latency, and it would scale well with large np. No tuning needed, no special cases, smaller memory footprint. I will leave it at that, just some inputs. Patrick
Re: [OMPI devel] Fortran 90 Interface
Can you send the information listed here: http://www.open-mpi.org/community/help/ On Jan 21, 2009, at 5:25 PM, David Robertson wrote: Hello, I'm having a problem with MPI_COMM_WORLD in Fortran 90. I have tried with OpenMPI versions 1.2.6, 1.2.8 and 1.3. Both versions are compiled with the PGI 8.0-2 suite. I've run the program in a debugger and with "USE mpi" and MPI_COMM_WORLD returns 'Cannot find name "MPI_COMM_WORLD"'. If I use "include mpif.h" results are a little better: MPI_COMM_WORLD returns 0 (the initial value assigned by mpif-common.h). The MPI functions don't seem to be affected by the fact that MPI_COMM_WORLD is unset or equal to 0. For example, the following works just fine: CALL mpi_init (MyError) CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError) CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError) even though, in the debugger, MPI_COMM_WORLD is unset or zero every step of the way. However, when I try to us MPI_COMM_WORLD in a non MPI standard function (Netcdf-4 in this case): status=nf90_create_par(TRIM(ncname), & & OR(nf90_clobber, nf90_netcdf4), & & MPI_COMM_WORLD, info, ncid) I get the following error: [daggoo:07640] *** An error occurred in MPI_Comm_dup [daggoo:07640] *** on communicator MPI_COMM_WORLD [daggoo:07640] *** MPI_ERR_COMM: invalid communicator [daggoo:07640] *** MPI_ERRORS_ARE_FATAL (goodbye) I have tried the exact same code compiled and run with MPICH2 (also PGI 8.0-2) and the problem does not occur. If I have forgotten any details needed to debug this issue, please let me know. Thanks, David Robertson ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Fortran 90 Interface
Hello, I'm having a problem with MPI_COMM_WORLD in Fortran 90. I have tried with OpenMPI versions 1.2.6, 1.2.8 and 1.3. Both versions are compiled with the PGI 8.0-2 suite. I've run the program in a debugger and with "USE mpi" and MPI_COMM_WORLD returns 'Cannot find name "MPI_COMM_WORLD"'. If I use "include mpif.h" results are a little better: MPI_COMM_WORLD returns 0 (the initial value assigned by mpif-common.h). The MPI functions don't seem to be affected by the fact that MPI_COMM_WORLD is unset or equal to 0. For example, the following works just fine: CALL mpi_init (MyError) CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError) CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError) even though, in the debugger, MPI_COMM_WORLD is unset or zero every step of the way. However, when I try to us MPI_COMM_WORLD in a non MPI standard function (Netcdf-4 in this case): status=nf90_create_par(TRIM(ncname), & & OR(nf90_clobber, nf90_netcdf4), & & MPI_COMM_WORLD, info, ncid) I get the following error: [daggoo:07640] *** An error occurred in MPI_Comm_dup [daggoo:07640] *** on communicator MPI_COMM_WORLD [daggoo:07640] *** MPI_ERR_COMM: invalid communicator [daggoo:07640] *** MPI_ERRORS_ARE_FATAL (goodbye) I have tried the exact same code compiled and run with MPICH2 (also PGI 8.0-2) and the problem does not occur. If I have forgotten any details needed to debug this issue, please let me know. Thanks, David Robertson
Re: [OMPI devel] VT problems on Debian
Can't speak officially for the VT folks, but it looks like the following bits in ompi/vt/vt/acinclude.m4 needs to list SPARC and Alpha (maybe ARM?) along side MIPS as gettimeofday() platforms. Alternatively (perhaps preferred) one should turn this around to explicitly list the platforms that *do* have cycle counter support (ppc64, ppc32, ia64, x86 IIRC) rather than listing those that don't. -Paul case $PLATFORM in linux) AC_DEFINE([TIMER_GETTIMEOFDAY], [1], [Use `gettimeofday' function]) AC_DEFINE([TIMER_CLOCK_GETTIME], [2], [Use `clock_gettime' function]) case $host_cpu in mips*) AC_DEFINE([TIMER], [TIMER_GETTIMEOFDAY], [Use timer (see below)]) AC_MSG_NOTICE([selected timer: TIMER_GETTIMEOFDAY]) ;; *) AC_DEFINE([TIMER_CYCLE_COUNTER], [3], [Cycle counter (e.g. TSC)]) AC_DEFINE([TIMER], [TIMER_CYCLE_COUNTER], [Use timer (see below)]) AC_MSG_NOTICE([selected timer: TIMER_CYCLE_COUNTER]) ;; esac ;; Jeff Squyres wrote: The Debian OMPI maintainers raised a few failures on some of their architectures to my attention -- it looks like there's some wonkyness on Debian on SPARC and Alpha -- scroll to the bottom of these two pages: http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=sparc&stamp=1232513504&file=log http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=alpha&stamp=1232510796&file=log They both seem to incur the same error: gcc -DHAVE_CONFIG_H -I. -I../../../../../../../ompi/contrib/vt/vt/vtlib -I.. -I../../../../../../../ompi/contrib/vt/vt/tools/opari/lib -I../../../../../../../ompi/contrib/vt/vt/extlib/otf/otflib -I../extlib/otf/otflib -I../../../../../../../ompi/contrib/vt/vt -D_GNU_SOURCE -DBINDIR=\"/usr/bin\" -DDATADIR=\"/usr/share/vampirtrace\" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -Wall -g -O2 -MT vt_pform_linux.o -MD -MP -MF .deps/vt_pform_linux.Tpo -c -o vt_pform_linux.o ../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c ../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c: In function 'vt_pform_wtime': ../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c:179: error: impossible constraint in 'asm' make[6]: *** [vt_pform_linux.o] Error 1 make[6]: Leaving directory `/build/buildd/openmpi-1.3/build/shared/ompi/contrib/vt/vt/vtlib' VT guys -- any ideas? -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory
Re: [OMPI devel] RFC: Use of ompi_proc_t flags field
Appropriate mapper components will be used, along with a file describing which nodes are in which CU etc. So it won't be so much a matter of discovery as pre-knowledge. On Jan 21, 2009, at 12:02 PM, Jeff Squyres wrote: Sounds reasonable. How do you plan to discover this information? On Jan 21, 2009, at 9:58 AM, Ralph Castain wrote: What: Extend the current use of the ompi_proc_t flags field (without changing the field itself) Why: Provide more atomistic sense of locality to support new collective/BTL components Where: Add macros to define and check the various flag fields in ompi/proc.h. Revise the orte_ess.proc_is_local API to return a uint8_t instead of bool. When: For OMPI v1.4 Timeout: COB Fri, Feb 6, 2009 The current ompi_proc_t structure has a uint8_t flags field in it. Only one bit of this field is currently used to flag that a proc is "local". In the current context, "local" is constrained to mean "local to this node". New collectives and BTL components under development by LANL (in partnership with others) require a greater degree of granularity on the term "local". For our work, we need to know if the proc is on the same socket, PC board, node, switch, and CU (computing unit). We therefore propose to define some of the unused bits to flag these "local" conditions. This will not extend the field's size, nor impact any other current use of the field. Our intent is to add #define's to designate which bits stand for which local condition. To make it easier to use, we will add a set of macros that test the specific bit - e.g., OMPI_PROC_ON_LOCAL_SOCKET. These can be used in the code base to clearly indicate which sense of locality is being considered. We would also modify the orte_ess modules so that each returns a uint8_t (to match the ompi_proc_t field) that contains a complete description of the locality of this proc. Obviously, not all environments will be capable of providing such detailed info. Thus, getting a "false" from a test for "on_local_socket" may simply indicate a lack of knowledge. This is acceptable for our purposes as the algorithm will simply perform sub-optimally, but will still work. Please feel free to comment and/or request more information. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Use of ompi_proc_t flags field
Sounds reasonable. How do you plan to discover this information? On Jan 21, 2009, at 9:58 AM, Ralph Castain wrote: What: Extend the current use of the ompi_proc_t flags field (without changing the field itself) Why: Provide more atomistic sense of locality to support new collective/BTL components Where: Add macros to define and check the various flag fields in ompi/proc.h. Revise the orte_ess.proc_is_local API to return a uint8_t instead of bool. When: For OMPI v1.4 Timeout: COB Fri, Feb 6, 2009 The current ompi_proc_t structure has a uint8_t flags field in it. Only one bit of this field is currently used to flag that a proc is "local". In the current context, "local" is constrained to mean "local to this node". New collectives and BTL components under development by LANL (in partnership with others) require a greater degree of granularity on the term "local". For our work, we need to know if the proc is on the same socket, PC board, node, switch, and CU (computing unit). We therefore propose to define some of the unused bits to flag these "local" conditions. This will not extend the field's size, nor impact any other current use of the field. Our intent is to add #define's to designate which bits stand for which local condition. To make it easier to use, we will add a set of macros that test the specific bit - e.g., OMPI_PROC_ON_LOCAL_SOCKET. These can be used in the code base to clearly indicate which sense of locality is being considered. We would also modify the orte_ess modules so that each returns a uint8_t (to match the ompi_proc_t field) that contains a complete description of the locality of this proc. Obviously, not all environments will be capable of providing such detailed info. Thus, getting a "false" from a test for "on_local_socket" may simply indicate a lack of knowledge. This is acceptable for our purposes as the algorithm will simply perform sub-optimally, but will still work. Please feel free to comment and/or request more information. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] VT problems on Debian
The Debian OMPI maintainers raised a few failures on some of their architectures to my attention -- it looks like there's some wonkyness on Debian on SPARC and Alpha -- scroll to the bottom of these two pages: http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=sparc&stamp=1232513504&file=log http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=alpha&stamp=1232510796&file=log They both seem to incur the same error: gcc -DHAVE_CONFIG_H -I. -I../../../../../../../ompi/contrib/vt/vt/ vtlib -I.. -I../../../../../../../ompi/contrib/vt/vt/tools/opari/lib - I../../../../../../../ompi/contrib/vt/vt/extlib/otf/otflib -I../extlib/ otf/otflib -I../../../../../../../ompi/contrib/vt/vt - D_GNU_SOURCE -DBINDIR=\"/usr/bin\" -DDATADIR=\"/usr/share/vampirtrace \" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -Wall -g -O2 -MT vt_pform_linux.o - MD -MP -MF .deps/vt_pform_linux.Tpo -c -o vt_pform_linux.o ../../../../../../../ompi/contrib/vt/vt/vtlib/ vt_pform_linux.c ../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c: In function 'vt_pform_wtime': ../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c:179: error: impossible constraint in 'asm' make[6]: *** [vt_pform_linux.o] Error 1 make[6]: Leaving directory `/build/buildd/openmpi-1.3/build/shared/ ompi/contrib/vt/vt/vtlib' VT guys -- any ideas? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: sm Latency
Patrick Geoffray wrote: Eugene Loh wrote: To recap: 1) The work is already done. How do you do "directed polling" with ANY_TAG ? Not sure I understand the question. So, maybe we start by being explicitly about what we mean by "directed polling". Currently, the sm BTL has connection-based FIFOs. That is, for each on-node sender/receiver (directed) pair, there is a FIFO. For a receiver to receive messages, it needs to check its in-bound FIFOs. It can check all in-bound FIFOs all the time to discover messages. By "directed polling", I mean that if the user posts a receive from a specified source, we poll only the FIFO on which that message is expected. With that in mind, let's go back to your question. If a user posts a receive with a specified source but a wildcard tag, we go to the specified FIFO. We check the item on the FIFO's tail. We check if this item is the one we're looking for. The "ANY_TAG" comes into play only here, on the matching. It's unrelated to "directed polling", which has to do only with the source process. Possibly, you meant to ask how one does directed polling with a wildcard source MPI_ANY_SOURCE. If that was your question, the answer is we punt. We report failure to the ULP, which reverts to the standard code path. One alternative is, of course, the single receiver queue. I agree that that alternative has many merits. To recap, however, the proposed optimizations are already "in the bag" (implemented in a workspace) and address some optimizations that are orthogonal to the "directed polling" (and single receiver queue) approach. I think there are also some uncertainties about the single recv queue approach, but I guess I'll just have to prototype that alternative to explore those uncertainties. How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ? There are a variety of choices here. Further, I'm afraid we ultimately have to expose some of those choices to the user (MCA parameters or something). Let's say some congestion is starting to build on some internal OMPI resource. Arguably, we should do something to start relieving that congestion. What if then the user code posts a rather specific request (receive a message with a particular tag on a particular communicator from a particular source) and with high urgency (blocking request... "I ain't going anywhere until you give me what I'm asking for"). A good servant would drop whatever else s/he is doing to oblige the boss. So, let's say there's a standard MPI_Recv. Let's say there's also some congestion starting to build. What should the MPI implementation do? Alternatives include: A) If the receive can be completed "immediately", then do so and return control to the user as soon as possible. B) If the receive cannot be completed "immediately", fill your wait time with general housekeeping like relieving congested resources. C) Figure out what's on the critical path and do it. At least A should be available for the user. Probably also B, and the RFC proposal allows for that by rolling over to the traditional code path when the request cannot be satisfied "immediately". (That said, there are different definitions of "immediately" and different ways of implementing all this.) The definitions I've used for "immediately" include: *) We know which FIFO to check. *) The message is the next item on that FIFO. *) The message is being delivered entirely in one chunk. I am also going to add a time-out. One could also mix a little bit of general polling in. (Unfortunately), there is no end to all the artful tuning one could do. What about the one-sided that Brian mentioned where there is no corresponding receive to tell you which queue to poll ? I appreciate Jeff's explanation, but I still don't understand this 100%. The receive side looks to see if it can handle the request "immediately". It checks to see if the next item on the specified FIFO is "the one". If it is, it completes the request. If not, it returns control to the ULP, who rolls over to the traditional code path. I don't 100% know how to handle the concern you/Brian raise, but I have the PML passing the flag MCA_PML_OB1_HDR_TYPE_MATCH into the BTL, saying "this is the kind of message to look for". Does this address the concern? The intent is that if it encounters something it doesn't know how to handle, it reverts to the traditional receive code path. If you want to handle all the constraints, a single-queue model is much less work in the end, IMHO. Again, important speedups appear to be achievable if one bypasses the PML receive-request data structure. So, we're talking about optimizations that are orthogonal to the single-queue issue. 2) The single-queue model addresses only one of the RFC's issues. The single-queue model addresses not only the latency overhead when scaling, but also the explo
[OMPI devel] RFC: Use of ompi_proc_t flags field
What: Extend the current use of the ompi_proc_t flags field (without changing the field itself) Why: Provide more atomistic sense of locality to support new collective/BTL components Where: Add macros to define and check the various flag fields in ompi/ proc.h. Revise the orte_ess.proc_is_local API to return a uint8_t instead of bool. When: For OMPI v1.4 Timeout: COB Fri, Feb 6, 2009 The current ompi_proc_t structure has a uint8_t flags field in it. Only one bit of this field is currently used to flag that a proc is "local". In the current context, "local" is constrained to mean "local to this node". New collectives and BTL components under development by LANL (in partnership with others) require a greater degree of granularity on the term "local". For our work, we need to know if the proc is on the same socket, PC board, node, switch, and CU (computing unit). We therefore propose to define some of the unused bits to flag these "local" conditions. This will not extend the field's size, nor impact any other current use of the field. Our intent is to add #define's to designate which bits stand for which local condition. To make it easier to use, we will add a set of macros that test the specific bit - e.g., OMPI_PROC_ON_LOCAL_SOCKET. These can be used in the code base to clearly indicate which sense of locality is being considered. We would also modify the orte_ess modules so that each returns a uint8_t (to match the ompi_proc_t field) that contains a complete description of the locality of this proc. Obviously, not all environments will be capable of providing such detailed info. Thus, getting a "false" from a test for "on_local_socket" may simply indicate a lack of knowledge. This is acceptable for our purposes as the algorithm will simply perform sub-optimally, but will still work. Please feel free to comment and/or request more information. Ralph
Re: [OMPI devel] RFC: sm Latency
Brian is referring to the "rdma" onesided component (OMPI osd framework) that directly invokes the BTL functions (vs. using the PML send/receive functions). The osd matching is quite different than pt2pt matching. His concern is that that model continues to work -- e.g., if the rdma osd component sends a message through a BTL that the other side not try to interpret and match it as a pt2pt message. Hence, the BTL would need to learn some new things; e.g., that it can match some (pml) messages but not all (rdma/osd), or perhaps it would need to learn about rdma/osd matching as well, or ...(something else)... IIRC, rdma/osd is the only other non-PML component that sends directly through the BTLs today. But that may change; I know that there are some who are working on various optimizations that may use the BTLs underneath (I don't want to cite them on a public list; this is unpublished research work at this point). On Jan 21, 2009, at 1:22 AM, Eugene Loh wrote: Brian Barrett wrote: I unfortunately don't have time to look in depth at the patch. But my concern is that currently (today, not at some made up time in the future, maybe), we use the BTLs for more than just MPI point- to- point. The rdma one-sided component (which was added for 1.3 and hopefully will be the default for 1.4) sends messages directly over the btls. It would be interesting to know how that is handled. I'm not sure I understand what you're saying. Does it help to point out that existing BTL routines don't change? The existing sendi is just a function that, if available, can be used, where appropriate, to send "immediately". Similarly for the proposed recvi. No existing BTL functionality is removed. Just new, optional functions added for whoever wants to (and can) use them. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: sm Latency
Richard Graham wrote: On 1/20/09 8:53 PM, "Jeff Squyres" wrote: Eugene: you mentioned that there are other possibilities to having the BTL understand match headers, such as a callback into the PML. Have you tried this approach to see what the performance cost would be, perchance? How is this different from the way matching is done today ? I think it would be very similar to how matching is done today. Again, however, trying to keep data structures to a minimum to shave latency off wherever we can.
Re: [OMPI devel] RFC: sm Latency
Richard Graham wrote: Re: [OMPI devel] RFC: sm Latency On 1/20/09 2:08 PM, "Eugene Loh"wrote: Richard Graham wrote: Re: [OMPI devel] RFC: sm Latency First, the performance improvements look really nice. A few questions: - How much of an abstraction violation does this introduce? Doesn't need to be much of an abstraction violation at all if, by that, we mean teaching the BTL about the match header. Just need to make some choices and I flagged that one for better visibility. >> I really don’t see how teaching the btl about matching will help much (it will save a subroutine call). As I understand >> the proposal you aim to selectively pull items out of the fifo’s – this will break the fifo’s, as they assume contiguous >> entries. Logic to manage holes will need to be added. No. It's still a FIFO. You look at the tail of the FIFO. If you can handle what you see there, you pop that item off and handle it. If you can't, you punt and return control to the ULP, who handles things the traditional (and heavier-weight) method. If the item of interest isn't at the tail, you won't see it. This looks like the btl needs to start “knowing” about MPI level semantics. That's one option. There are other options. >> Such as ? PML callback. Jeff's question about how much performance (if any) one loses with callback is a good one. If I were less lazy (and had more infinite time), I would have tested that before sending out the RFC. As it was, I wanted to see how much pushback there would be on the "abstract violation" issue. Enough, it turns out, to try the experiment. I'll try to test it out and report back. If you replace the fifo’s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple writers and single reader in non-threaded situation) there will be only one place to poll, regardless of the number of procs involved in the run. *) Doesn't strike me as a "simple" change. Let me be clear that I can see many benefits to this approach and don't think it's prohibitively hard. So, I'm not trying to shoot this approach down entirely. I do have the proposed approach implemented, though, and it seems like a smaller change in behavior from what we have today, and many of the optimizations are unrelated to polling (and hence to the "single queue" proposal).
Re: [OMPI devel] RFC: sm Latency
Brian Barrett wrote: I unfortunately don't have time to look in depth at the patch. But my concern is that currently (today, not at some made up time in the future, maybe), we use the BTLs for more than just MPI point-to- point. The rdma one-sided component (which was added for 1.3 and hopefully will be the default for 1.4) sends messages directly over the btls. It would be interesting to know how that is handled. I'm not sure I understand what you're saying. Does it help to point out that existing BTL routines don't change? The existing sendi is just a function that, if available, can be used, where appropriate, to send "immediately". Similarly for the proposed recvi. No existing BTL functionality is removed. Just new, optional functions added for whoever wants to (and can) use them.