Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh




Ron Brightwell wrote:

  
If you poll only the queue that correspond to a posted receive, you only optimize micro-benchmarks, until they start using ANY_SOURCE.

  
  Note that the HPCC RandomAccess benchmark only uses MPI_ANY_SOURCE (and
MPI_ANY_TAG).
  

But HPCC RandomAccess also just uses non-blocking receives.  So, it's
kind of outside the scope of the original ideas here (bypassing the PML
receive-request data structure).

It's possibly not even a poster child for the single-queue idea
either.  Single queue probably shines best when you have to poll all
connections for a few messages.  In contrast, RandomAccess (I think)
loads all connections up randomly (pseudo-evenly).




Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh

Patrick Geoffray wrote:


Eugene Loh wrote:

Possibly, you meant to ask how one does directed polling with a 
wildcard source MPI_ANY_SOURCE.  If that was your question, the 
answer is we punt.  We report failure to the ULP, which reverts to 
the standard code path.


Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond 
to a posted receive, you only optimize micro-benchmarks, until they 
start using ANY_SOURCE.


Right.

So, does recvi() is a one-time shot ? Ie do you poll the right queue 
only once and if it fails then you fall back on polling all queues ?


You poll it "some".  The BTL is granted some leeway in what 
"immediately" means.



If yes, then it's unobtrusive but I don't think it would help much.


Well, check the RFC.  The data shows huge improvements in HPCC latency.

If you poll the right queue many times, then you have to decide when 
to fall back on polling all queues, and it's not trivial.


It's not 100% satisfactory, but clearly OMPI (and every other MPI 
implementation and just about any major piece of HPC software) is trying 
to guess among all sorts of trade-offs.  Many of those trade-offs are 
user tunable -- hence, those pages and pages compiler options (pick your 
favorite compiler), build flags, MCA parameters, etc.


How do you ensure you check all incoming queues from time to time to 
prevent flow control (specially if the queues are small for scaling) ?


There are a variety of choices here.  Further, I'm afraid we 
ultimately have to expose some of those choices to the user (MCA 
parameters or something).


In the vast majority of cases, users don't know how to turn the knobs.


Totally agree.  Exposing these choices to the users is ugly and 
expecting users to make such choices is ridiculous.  Though, for what 
it's worth:


% ompi_info -a | wc -l
1037
%

I actually agree with you a lot.  I do think that my RFC represents one 
step forward.  I'll see how quickly I can prototype and characterize a 
single-queue solution so we can judge alternatives more diligently.


Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Ron Brightwell
> > Possibly, you meant to ask how one does directed polling with a wildcard 
> > source MPI_ANY_SOURCE.  If that was your question, the answer is we 
> > punt.  We report failure to the ULP, which reverts to the standard code 
> > path.
> 
> Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to 
> a posted receive, you only optimize micro-benchmarks, until they start 
> using ANY_SOURCE.
> [...]

Note that the HPCC RandomAccess benchmark only uses MPI_ANY_SOURCE (and
MPI_ANY_TAG).

-Ron




Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Patrick Geoffray

Eugene Loh wrote:
Possibly, you meant to ask how one does directed polling with a wildcard 
source MPI_ANY_SOURCE.  If that was your question, the answer is we 
punt.  We report failure to the ULP, which reverts to the standard code 
path.


Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to 
a posted receive, you only optimize micro-benchmarks, until they start 
using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll 
the right queue only once and if it fails then you fall back on polling 
all queues ? If yes, then it's unobtrusive but I don't think it would 
help much. If you poll the right queue many times, then you have to 
decide when to fall back on polling all queues, and it's not trivial.



How do you ensure you check all incoming queues from time to time to prevent 
flow control (specially if the queues are small for scaling) ?
There are a variety of choices here.  Further, I'm afraid we ultimately 
have to expose some of those choices to the user (MCA parameters or 
something).


In the vast majority of cases, users don't know how to turn the knobs. 
The problem is that with local np going up, queue sizes will go down 
fast (square root), and you will have to poll all queues more often. 
Using more memory for queues just pushed the scalability wall a little 
bit further.


congestion.  What if then the user code posts a rather specific request 
(receive a message with a particular tag on a particular communicator 
from a particular source) and with high urgency (blocking request... "I 
ain't going anywhere until you give me what I'm asking for").  A good 
servant would drop whatever else s/he is doing to oblige the boss.


If you poll only one queue, then stuff can pile up on another and a 
sender is now blocked. At best, you have a synchronization point. At 
worst, a deadlock.


So, let's say there's a standard MPI_Recv.  Let's say there's also some 
congestion starting to build.  What should the MPI implementation do?  


The MPI implementation cannot trust the user/app to indicates where the 
messages will come from. So, if you have N incoming queues, you need to 
poll them all eventually. If you do, polling time increase linearly. If 
you try to limit the polling space with whatever heuristic (like the 
queue corresponding to the current blocking receive), then you take the 
risk of not consuming fast enough another queue. And usually, the 
heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous 
receives, etc).


Really, only single-queue solves that.

Yes, and you could toss the receive-side optimizations as well.  So, one 
could say, "Our np=2 latency remains 2x slower than Scali's, but at 
least we no longer have that hideous scaling with large np."  Maybe 
that's where we want to end up.


I think all optimizations except recvi() are fine and worth using. I am 
just saying that the recvi() optimization is dubious as it is, and the 
single-queue is potentially a larger hanging fruit on the recv side: it 
could still be fast (spinlock or atomic to manage shared receive queue) 
to have lower np=2 latency, and it would scale well with large np. No 
tuning needed, no special cases, smaller memory footprint.


I will leave it at that, just some inputs.

Patrick


Re: [OMPI devel] Fortran 90 Interface

2009-01-21 Thread Jeff Squyres

Can you send the information listed here:

http://www.open-mpi.org/community/help/


On Jan 21, 2009, at 5:25 PM, David Robertson wrote:


Hello,

I'm having a problem with MPI_COMM_WORLD in Fortran 90. I have tried  
with OpenMPI versions 1.2.6, 1.2.8 and 1.3. Both versions are  
compiled with the PGI 8.0-2 suite. I've run the program in a  
debugger and with "USE mpi" and MPI_COMM_WORLD returns 'Cannot find  
name "MPI_COMM_WORLD"'. If I use "include mpif.h" results are a  
little better: MPI_COMM_WORLD returns 0 (the initial value assigned  
by mpif-common.h). The MPI functions don't seem to be affected by  
the fact that MPI_COMM_WORLD is unset or equal to 0. For example,  
the following works just fine:


CALL mpi_init (MyError)
CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError)
CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError)

even though, in the debugger, MPI_COMM_WORLD is unset or zero every  
step of the way. However, when I try to us MPI_COMM_WORLD in a non  
MPI standard function (Netcdf-4 in this case):



status=nf90_create_par(TRIM(ncname),   &
& OR(nf90_clobber,  
nf90_netcdf4),   &

& MPI_COMM_WORLD, info, ncid)

 I get the following error:

[daggoo:07640] *** An error occurred in MPI_Comm_dup
[daggoo:07640] *** on communicator MPI_COMM_WORLD
[daggoo:07640] *** MPI_ERR_COMM: invalid communicator
[daggoo:07640] *** MPI_ERRORS_ARE_FATAL (goodbye)

I have tried the exact same code compiled and run with MPICH2 (also  
PGI

8.0-2) and the problem does not occur.

If I have forgotten any details needed to debug this issue, please let
me know.

Thanks,
David Robertson
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] Fortran 90 Interface

2009-01-21 Thread David Robertson

Hello,

I'm having a problem with MPI_COMM_WORLD in Fortran 90. I have tried 
with OpenMPI versions 1.2.6, 1.2.8 and 1.3. Both versions are compiled 
with the PGI 8.0-2 suite. I've run the program in a debugger and with 
"USE mpi" and MPI_COMM_WORLD returns 'Cannot find name 
"MPI_COMM_WORLD"'. If I use "include mpif.h" results are a little 
better: MPI_COMM_WORLD returns 0 (the initial value assigned by 
mpif-common.h). The MPI functions don't seem to be affected by the fact 
that MPI_COMM_WORLD is unset or equal to 0. For example, the following 
works just fine:


 CALL mpi_init (MyError)
 CALL mpi_comm_rank (MPI_COMM_WORLD, MyRank, MyError)
 CALL mpi_comm_size (MPI_COMM_WORLD, Nnodes, MyError)

even though, in the debugger, MPI_COMM_WORLD is unset or zero every step 
of the way. However, when I try to us MPI_COMM_WORLD in a non MPI 
standard function (Netcdf-4 in this case):


status=nf90_create_par(TRIM(ncname),   &
 & OR(nf90_clobber, nf90_netcdf4), 
  &

 & MPI_COMM_WORLD, info, ncid)

  I get the following error:

[daggoo:07640] *** An error occurred in MPI_Comm_dup
[daggoo:07640] *** on communicator MPI_COMM_WORLD
[daggoo:07640] *** MPI_ERR_COMM: invalid communicator
[daggoo:07640] *** MPI_ERRORS_ARE_FATAL (goodbye)

I have tried the exact same code compiled and run with MPICH2 (also PGI
8.0-2) and the problem does not occur.

If I have forgotten any details needed to debug this issue, please let
me know.

Thanks,
David Robertson


Re: [OMPI devel] VT problems on Debian

2009-01-21 Thread Paul H. Hargrove
Can't speak officially for the VT folks, but it looks like the following 
bits in ompi/vt/vt/acinclude.m4 needs to list SPARC and Alpha (maybe 
ARM?) along side MIPS as gettimeofday() platforms.  Alternatively 
(perhaps preferred) one should turn this around to explicitly list the 
platforms that *do* have cycle counter support (ppc64, ppc32, ia64, x86 
IIRC) rather than listing those that don't.


-Paul

   case $PLATFORM in
   linux)
   AC_DEFINE([TIMER_GETTIMEOFDAY], [1], [Use `gettimeofday' 
function])
   AC_DEFINE([TIMER_CLOCK_GETTIME], [2], [Use 
`clock_gettime' function])

   case $host_cpu in
   mips*)
   AC_DEFINE([TIMER], [TIMER_GETTIMEOFDAY], 
[Use timer (see below)])
   AC_MSG_NOTICE([selected timer: 
TIMER_GETTIMEOFDAY])

   ;;
   *)
   AC_DEFINE([TIMER_CYCLE_COUNTER], [3], 
[Cycle counter (e.g. TSC)])
   AC_DEFINE([TIMER], 
[TIMER_CYCLE_COUNTER], [Use timer (see below)])
   AC_MSG_NOTICE([selected timer: 
TIMER_CYCLE_COUNTER])

   ;;
   esac
   ;;


Jeff Squyres wrote:
The Debian OMPI maintainers raised a few failures on some of their 
architectures to my attention -- it looks like there's some wonkyness 
on Debian on SPARC and Alpha -- scroll to the bottom of these two pages:


http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=sparc&stamp=1232513504&file=log 

http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=alpha&stamp=1232510796&file=log 



They both seem to incur the same error:

gcc -DHAVE_CONFIG_H -I. 
-I../../../../../../../ompi/contrib/vt/vt/vtlib -I.. 
-I../../../../../../../ompi/contrib/vt/vt/tools/opari/lib 
-I../../../../../../../ompi/contrib/vt/vt/extlib/otf/otflib 
-I../extlib/otf/otflib -I../../../../../../../ompi/contrib/vt/vt  
-D_GNU_SOURCE -DBINDIR=\"/usr/bin\" 
-DDATADIR=\"/usr/share/vampirtrace\" -DRFG  -DVT_MEMHOOK -DVT_IOWRAP  
-Wall -g -O2 -MT vt_pform_linux.o -MD -MP -MF .deps/vt_pform_linux.Tpo 
-c -o vt_pform_linux.o 
../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c
../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c: In 
function 'vt_pform_wtime':
../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c:179: 
error: impossible constraint in 'asm'

make[6]: *** [vt_pform_linux.o] Error 1
make[6]: Leaving directory 
`/build/buildd/openmpi-1.3/build/shared/ompi/contrib/vt/vt/vtlib'


VT guys -- any ideas?




--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory 



Re: [OMPI devel] RFC: Use of ompi_proc_t flags field

2009-01-21 Thread Ralph Castain
Appropriate mapper components will be used, along with a file  
describing which nodes are in which CU etc. So it won't be so much a  
matter of discovery as pre-knowledge.



On Jan 21, 2009, at 12:02 PM, Jeff Squyres wrote:


Sounds reasonable.  How do you plan to discover this information?

On Jan 21, 2009, at 9:58 AM, Ralph Castain wrote:

What: Extend the current use of the ompi_proc_t flags field  
(without changing the field itself)


Why: Provide more atomistic sense of locality to support new  
collective/BTL components


Where: Add macros to define and check the various flag fields in  
ompi/proc.h. Revise the orte_ess.proc_is_local API to return a  
uint8_t instead of bool.


When: For OMPI v1.4

Timeout: COB Fri, Feb 6, 2009



The current ompi_proc_t structure has a uint8_t flags field in it.  
Only one bit of this field is currently used to flag that a proc is  
"local". In the current context, "local" is constrained to mean  
"local to this node".


New collectives and BTL components under development by LANL (in  
partnership with others) require a greater degree of granularity on  
the term "local". For our work, we need to know if the proc is on  
the same socket, PC board, node, switch, and CU (computing unit).  
We therefore propose to define some of the unused bits to flag  
these "local" conditions. This will not extend the field's size,  
nor impact any other current use of the field.


Our intent is to add #define's to designate which bits stand for  
which local condition. To make it easier to use, we will add a set  
of macros that test the specific bit - e.g.,  
OMPI_PROC_ON_LOCAL_SOCKET. These can be used in the code base to  
clearly indicate which sense of locality is being considered.


We would also modify the orte_ess modules so that each returns a  
uint8_t (to match the ompi_proc_t field) that contains a complete  
description of the locality of this proc. Obviously, not all  
environments will be capable of providing such detailed info. Thus,  
getting a "false" from a test for "on_local_socket" may simply  
indicate a lack of knowledge. This is acceptable for our purposes  
as the algorithm will simply perform sub-optimally, but will still  
work.


Please feel free to comment and/or request more information.
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] RFC: Use of ompi_proc_t flags field

2009-01-21 Thread Jeff Squyres

Sounds reasonable.  How do you plan to discover this information?

On Jan 21, 2009, at 9:58 AM, Ralph Castain wrote:

What: Extend the current use of the ompi_proc_t flags field (without  
changing the field itself)


Why: Provide more atomistic sense of locality to support new  
collective/BTL components


Where: Add macros to define and check the various flag fields in  
ompi/proc.h. Revise the orte_ess.proc_is_local API to return a  
uint8_t instead of bool.


When: For OMPI v1.4

Timeout: COB Fri, Feb 6, 2009



The current ompi_proc_t structure has a uint8_t flags field in it.  
Only one bit of this field is currently used to flag that a proc is  
"local". In the current context, "local" is constrained to mean  
"local to this node".


New collectives and BTL components under development by LANL (in  
partnership with others) require a greater degree of granularity on  
the term "local". For our work, we need to know if the proc is on  
the same socket, PC board, node, switch, and CU (computing unit). We  
therefore propose to define some of the unused bits to flag these  
"local" conditions. This will not extend the field's size, nor  
impact any other current use of the field.


Our intent is to add #define's to designate which bits stand for  
which local condition. To make it easier to use, we will add a set  
of macros that test the specific bit - e.g.,  
OMPI_PROC_ON_LOCAL_SOCKET. These can be used in the code base to  
clearly indicate which sense of locality is being considered.


We would also modify the orte_ess modules so that each returns a  
uint8_t (to match the ompi_proc_t field) that contains a complete  
description of the locality of this proc. Obviously, not all  
environments will be capable of providing such detailed info. Thus,  
getting a "false" from a test for "on_local_socket" may simply  
indicate a lack of knowledge. This is acceptable for our purposes as  
the algorithm will simply perform sub-optimally, but will still work.


Please feel free to comment and/or request more information.
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] VT problems on Debian

2009-01-21 Thread Jeff Squyres
The Debian OMPI maintainers raised a few failures on some of their  
architectures to my attention -- it looks like there's some wonkyness  
on Debian on SPARC and Alpha -- scroll to the bottom of these two pages:


http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=sparc&stamp=1232513504&file=log
http://buildd.debian.org/fetch.cgi?&pkg=openmpi&ver=1.3-1&arch=alpha&stamp=1232510796&file=log

They both seem to incur the same error:

gcc -DHAVE_CONFIG_H -I. -I../../../../../../../ompi/contrib/vt/vt/ 
vtlib -I.. -I../../../../../../../ompi/contrib/vt/vt/tools/opari/lib - 
I../../../../../../../ompi/contrib/vt/vt/extlib/otf/otflib -I../extlib/ 
otf/otflib -I../../../../../../../ompi/contrib/vt/vt  - 
D_GNU_SOURCE -DBINDIR=\"/usr/bin\" -DDATADIR=\"/usr/share/vampirtrace 
\" -DRFG  -DVT_MEMHOOK -DVT_IOWRAP  -Wall -g -O2 -MT vt_pform_linux.o - 
MD -MP -MF .deps/vt_pform_linux.Tpo -c -o  
vt_pform_linux.o ../../../../../../../ompi/contrib/vt/vt/vtlib/ 
vt_pform_linux.c
../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c: In  
function 'vt_pform_wtime':
../../../../../../../ompi/contrib/vt/vt/vtlib/vt_pform_linux.c:179:  
error: impossible constraint in 'asm'

make[6]: *** [vt_pform_linux.o] Error 1
make[6]: Leaving directory `/build/buildd/openmpi-1.3/build/shared/ 
ompi/contrib/vt/vt/vtlib'


VT guys -- any ideas?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh




Patrick Geoffray wrote:

  Eugene Loh wrote:
  
  
To recap:
1) The work is already done.

  
  How do you do "directed polling" with ANY_TAG ?

Not sure I understand the question.  So, maybe we start by being
explicitly about what we mean by "directed polling".

Currently, the sm BTL has connection-based FIFOs.  That is, for each
on-node sender/receiver (directed) pair, there is a FIFO.  For a
receiver to receive messages, it needs to check its in-bound FIFOs.  It
can check all in-bound FIFOs all the time to discover messages.  By
"directed polling", I mean that if the user posts a receive from a
specified source, we poll only the FIFO on which that message is
expected.

With that in mind, let's go back to your question.  If a user posts a
receive with a specified source but a wildcard tag, we go to the
specified FIFO.  We check the item on the FIFO's tail.  We check if
this item is the one we're looking for.  The "ANY_TAG" comes into play
only here, on the matching.  It's unrelated to "directed polling",
which has to do only with the source process.

Possibly, you meant to ask how one does directed polling with a
wildcard source MPI_ANY_SOURCE.  If that was your question, the answer
is we punt.  We report failure to the ULP, which reverts to the
standard code path.

One alternative is, of course, the single receiver queue.  I agree that
that alternative has many merits.  To recap, however, the proposed
optimizations are already "in the bag" (implemented in a workspace) and
address some optimizations that are orthogonal to the "directed
polling" (and single receiver queue) approach.  I think there are also
some uncertainties about the single recv queue approach, but I guess
I'll just have to prototype that alternative to explore those
uncertainties.

  How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ?

There are a variety of choices here.  Further, I'm afraid we ultimately
have to expose some of those choices to the user (MCA parameters or
something).

Let's say some congestion is starting to build on some internal OMPI
resource.  Arguably, we should do something to start relieving that
congestion.  What if then the user code posts a rather specific request
(receive a message with a particular tag on a particular communicator
from a particular source) and with high urgency (blocking request... "I
ain't going anywhere until you give me what I'm asking for").  A good
servant would drop whatever else s/he is doing to oblige the boss.

So, let's say there's a standard MPI_Recv.  Let's say there's also some
congestion starting to build.  What should the MPI implementation do? 
Alternatives include:
A) If the receive can be completed "immediately", then do so and return
control to the user as soon as possible.
B) If the receive cannot be completed "immediately", fill your wait
time with general housekeeping like relieving congested resources.
C) Figure out what's on the critical path and do it.

At least A should be available for the user.  Probably also B, and the
RFC proposal allows for that by rolling over to the traditional code
path when the request cannot be satisfied "immediately".  (That said,
there are different definitions of "immediately" and different ways of
implementing all this.)

The definitions I've used for "immediately" include:
*) We know which FIFO to check.
*) The message is the next item on that FIFO.
*) The message is being delivered entirely in one chunk.

I am also going to add a time-out.

One could also mix a little bit of general polling in. 
(Unfortunately), there is no end to all the artful tuning one could do.

  What about the one-sided that Brian mentioned where there is no corresponding receive to tell you which queue to poll ?
  

I appreciate Jeff's explanation, but I still don't understand this
100%.  The receive side looks to see if it can handle the request
"immediately".  It checks to see if the next item on the specified FIFO
is "the one".  If it is, it completes the request.  If not, it returns
control to the ULP, who rolls over to the traditional code path.

I don't 100% know how to handle the concern you/Brian raise, but I have
the PML passing the flag MCA_PML_OB1_HDR_TYPE_MATCH into the BTL,
saying "this is the kind of message to look for".  Does this address
the concern?  The intent is that if it encounters something it doesn't
know how to handle, it reverts to the traditional receive code path.

  If you want to handle all the constraints, a single-queue model is much less work in the end, IMHO.
  

Again, important speedups appear to be achievable if one bypasses the
PML receive-request data structure.  So, we're talking about
optimizations that are orthogonal to the single-queue issue.

  
2) The single-queue model addresses only one of the RFC's issues.

  
  The single-queue model addresses not only the latency overhead when
scaling, but also the explo

[OMPI devel] RFC: Use of ompi_proc_t flags field

2009-01-21 Thread Ralph Castain
What: Extend the current use of the ompi_proc_t flags field (without  
changing the field itself)


Why: Provide more atomistic sense of locality to support new  
collective/BTL components


Where: Add macros to define and check the various flag fields in ompi/ 
proc.h. Revise the orte_ess.proc_is_local API to return a uint8_t  
instead of bool.


When: For OMPI v1.4

Timeout: COB Fri, Feb 6, 2009



The current ompi_proc_t structure has a uint8_t flags field in it.  
Only one bit of this field is currently used to flag that a proc is  
"local". In the current context, "local" is constrained to mean "local  
to this node".


New collectives and BTL components under development by LANL (in  
partnership with others) require a greater degree of granularity on  
the term "local". For our work, we need to know if the proc is on the  
same socket, PC board, node, switch, and CU (computing unit). We  
therefore propose to define some of the unused bits to flag these  
"local" conditions. This will not extend the field's size, nor impact  
any other current use of the field.


Our intent is to add #define's to designate which bits stand for which  
local condition. To make it easier to use, we will add a set of macros  
that test the specific bit - e.g., OMPI_PROC_ON_LOCAL_SOCKET. These  
can be used in the code base to clearly indicate which sense of  
locality is being considered.


We would also modify the orte_ess modules so that each returns a  
uint8_t (to match the ompi_proc_t field) that contains a complete  
description of the locality of this proc. Obviously, not all  
environments will be capable of providing such detailed info. Thus,  
getting a "false" from a test for "on_local_socket" may simply  
indicate a lack of knowledge. This is acceptable for our purposes as  
the algorithm will simply perform sub-optimally, but will still work.


Please feel free to comment and/or request more information.
Ralph



Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Jeff Squyres
Brian is referring to the "rdma" onesided component (OMPI osd  
framework) that directly invokes the BTL functions (vs. using the PML  
send/receive functions).  The osd matching is quite different than  
pt2pt matching.


His concern is that that model continues to work -- e.g., if the rdma  
osd component sends a message through a BTL that the other side not  
try to interpret and match it as a pt2pt message.  Hence, the BTL  
would need to learn some new things; e.g., that it can match some  
(pml) messages but not all (rdma/osd), or perhaps it would need to  
learn about rdma/osd matching as well, or ...(something else)...


IIRC, rdma/osd is the only other non-PML component that sends directly  
through the BTLs today.  But that may change; I know that there are  
some who are working on various optimizations that may use the BTLs  
underneath (I don't want to cite them on a public list; this is  
unpublished research work at this point).



On Jan 21, 2009, at 1:22 AM, Eugene Loh wrote:


Brian Barrett wrote:

I unfortunately don't have time to look in depth at the patch.  But  
my  concern is that currently (today, not at some made up time in  
the  future, maybe), we use the BTLs for more than just MPI point- 
to- point.  The rdma one-sided component (which was added for 1.3  
and  hopefully will be the default for 1.4) sends messages directly  
over  the btls.  It would be interesting to know how that is handled.


I'm not sure I understand what you're saying.

Does it help to point out that existing BTL routines don't change?   
The existing sendi is just a function that, if available, can be  
used, where appropriate, to send "immediately".  Similarly for the  
proposed recvi.  No existing BTL functionality is removed.  Just  
new, optional functions added for whoever wants to (and can) use them.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh




Richard Graham wrote:

  On 1/20/09 8:53 PM, "Jeff Squyres"  wrote:
  
  
Eugene: you mentioned that there are other possibilities to having the
BTL understand match headers, such as a callback into the PML.  Have
you tried this approach to see what the performance cost would be,
perchance?

  
  How is this different from the way matching is done today ?
  

I think it would be very similar to how matching is done today.  Again,
however, trying to keep data structures to a minimum to shave latency
off wherever we can.




Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh




Richard Graham wrote:

  Re: [OMPI devel] RFC: sm Latency
  On 1/20/09 2:08 PM, "Eugene Loh"  wrote:
  
  Richard Graham wrote: 

 Re: [OMPI devel] RFC: sm Latency First, the
performance improvements look really nice.
A few questions:
  - How much of an abstraction violation does this introduce?
  
Doesn't need to be much of an abstraction
violation at all if, by that, we mean teaching the BTL about the match
header.  Just need to make some choices and I flagged that one for
better visibility.

>> I really don’t see how teaching the btl about matching will
help much (it will save a subroutine call).  As I understand
>> the proposal you aim to selectively pull items out of the
fifo’s – this will break the fifo’s, as they assume contiguous
>> entries.  Logic to manage holes will need to be added.


No.  It's still a FIFO.  You look at the tail of the FIFO.  If you can
handle what you see there, you pop that item off and handle it.  If you
can't, you punt and return control to the ULP, who handles things the
traditional (and heavier-weight) method.  If the item of interest isn't
at the tail, you won't see it.

  This looks like the btl needs to start
“knowing” about MPI level semantics.
That's one option.  There are other options.

>> Such as ?


PML callback.  Jeff's question about how much performance (if any) one
loses with callback is a good one.  If I were less lazy (and had more
infinite time), I would have tested that before sending out the RFC. 
As it was, I wanted to see how much pushback there would be on the
"abstract violation" issue.  Enough, it turns out, to try the
experiment.  I'll try to test it out and report back.

  
If you replace the fifo’s with a single link
list per process in shared memory, with senders to this process adding
match envelopes atomically, with each process reading its own link list
(multiple writers and single reader in non-threaded situation) there
will be only one place to poll, regardless of the number of procs
involved in the run.

*) Doesn't strike me as a "simple" change.


Let me be clear that I can see many benefits to this approach and don't
think it's prohibitively hard.  So, I'm not trying to shoot this
approach down entirely.  I do have the proposed approach implemented,
though, and it seems like a smaller change in behavior from what we
have today, and many of the optimizations are unrelated to polling (and
hence to the "single queue" proposal).




Re: [OMPI devel] RFC: sm Latency

2009-01-21 Thread Eugene Loh

Brian Barrett wrote:

I unfortunately don't have time to look in depth at the patch.  But 
my  concern is that currently (today, not at some made up time in the  
future, maybe), we use the BTLs for more than just MPI point-to- 
point.  The rdma one-sided component (which was added for 1.3 and  
hopefully will be the default for 1.4) sends messages directly over  
the btls.  It would be interesting to know how that is handled.


I'm not sure I understand what you're saying.

Does it help to point out that existing BTL routines don't change?  The 
existing sendi is just a function that, if available, can be used, where 
appropriate, to send "immediately".  Similarly for the proposed recvi.  
No existing BTL functionality is removed.  Just new, optional functions 
added for whoever wants to (and can) use them.