Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-16 Thread Sylvain Jeaugey

On Mon, 15 Nov 2010, Ralph Castain wrote:


Guess I am a little confused. Every MPI process already has full knowledge
of what node all other processes are located on - this has been true for
quite a long time.

Ok, I didn't see that.


Once my work is complete, mpirun will have full knowledge of each node's
hardware resources. Terry will then use that in mpirun's mappers. The
resulting launch message will contain a full mapping of procs to cores -
i.e., every daemon will know the core placement of every process in the job.
That info will be passed down to each MPI proc. Thus, upon launch, every MPI
process will know not only the node for each process, but also the hardware
resources of that node, and the bindings of every process in the job to that
hardware.

Allright.

Some things bug me however :
 1. What if the placement has been done by a wrapper script or by the 
resource manager ? I.e. how do you know where MPI procs are located ?
 2. How scalable is it ? I would think there an allgather with 1 process 
per node ; am I right ?

 3. How is that information represented ? As a graph ?


So the only thing missing is the switch topology of the cluster (the
inter-node topology). We modified carto a while back to support input of
switch topology information, though I'm not sure how many people ever used
that capability - not much value in it so far. We just set it up so that
people could describe the topology, and then let carto compute hop distance.

Ok. I didn't know we also had some work on switches in carto.


HTH

This helps !

So, I'm now wondering if both work, which would seem similar are really 
redundant. We though about this before starting hitopo, and since a graph 
didn't fit our needs, we started work towards computing an address. 
Perhaps hitopo addresses could be computed using hwloc's graph.


I understand that for sm optimization, hwloc is richer. The only thing 
that bugs me is how much time it takes to figure out what capability I 
have between process A and B. The great thing in hitopo is that a single 
comparison can give you a property of two processes (e.g. they are on the 
same socket).


Anyway, I just wanted to present hitopo in case someone would need it. And 
I think hitopo's prefered domain remains collectives, where you do not 
really need distances, but groups which share a certain locality.


Sylvain


On Mon, Nov 15, 2010 at 9:00 AM, Sylvain Jeaugey
wrote:


I already mentionned it answering Terry's e-mail, but to be sure I'm clear
: don't confuse node full topology with MPI job topology. It _is_ different.

And every process does not get the whole topology in hitopo, only its own,
which should not cause storms.


On Mon, 15 Nov 2010, Ralph Castain wrote:

 I think the two efforts (the paffinity and this one) do overlap somewhat.

I've been writing the local topology discovery code for Jeff, Terry, and
Josh - uses hwloc (or any other method - it's a framework) to discover
what
hardware resources are available on each node in the job so that the info
can be used in mapping the procs.

As part of that work, we are passing down to the mpi processes the local
hardware topology. This is done because of prior complaints when we had
each
mpi process discover that info for itself - it creates a bit of a "storm"
on
the node of large smp's.

Note that what I've written (still to be completed before coming over)
doesn't tell the proc what cores/HT's it is bound to - that's the part
Terry
et al are adding. Nor were we discovering the switch topology of the
cluster.

So a little overlap that could be resolved. And a concern on my part: we
have previously introduced capabilities that had every mpi process read
local system files to get node topology, and gotten user complaints about
it. We probably shouldn't go back to that practice.

Ralph


On Mon, Nov 15, 2010 at 8:15 AM, Terry Dontje 
wrote:


  A few comments:


1.  Have you guys considered using hwloc for level 4-7 detection?
2.  Is L2 related to L2 cache?  If no then is there some other term you
could use?
3.  What do you see if the process is bound to multiple
cores/hyperthreads?
4.  What do you see if the process is not bound to any level 4-7 items?
5.  What about L1 and L2 cache locality as some levels? (hwloc exposes
these but these are also at different depths depending on the platform).

Note I am working with Jeff Squyres and Josh Hursey on some new paffinity
code that uses hwloc.  Though the paffinity code may not have direct
relationship to hitopo the use of hwloc and standardization of what you
call
level 4-7 might help avoid some user confusions.

--td


On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:

As a followup of Stuttgart's developper's meeting, here is an RFC for our
topology detection framework.

WHAT: Add a framework for hardware topology detection to be used by any
other part of Open MPI to help optimization.

WHY: Collective operations or shared memory algorithms among others may
have optimizations dependin

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-16 Thread Ralph Castain
On Tue, Nov 16, 2010 at 1:23 AM, Sylvain Jeaugey
wrote:

> On Mon, 15 Nov 2010, Ralph Castain wrote:
>
>  Guess I am a little confused. Every MPI process already has full knowledge
>> of what node all other processes are located on - this has been true for
>> quite a long time.
>>
> Ok, I didn't see that.



It's in the ess. There are two relevant API's there:

1. proc_get_locality tells you the relative locality of the specified proc.
It returns a bit mask that you can test with the defined values in
opal/mca/paffinity/paffinity.h - e.g., OPAL_PROC_ON_SOCKET.

2. proc_get_nodename returns the name of the node where that proc is
located.

Both of these APIs are called by various parts of OMPI - e.g., to initialize
the OMPI proc structs and setup shared memory.


>
>  Once my work is complete, mpirun will have full knowledge of each node's
>> hardware resources. Terry will then use that in mpirun's mappers. The
>> resulting launch message will contain a full mapping of procs to cores -
>> i.e., every daemon will know the core placement of every process in the
>> job.
>> That info will be passed down to each MPI proc. Thus, upon launch, every
>> MPI
>> process will know not only the node for each process, but also the
>> hardware
>> resources of that node, and the bindings of every process in the job to
>> that
>> hardware.
>>
> Allright.
>
> Some things bug me however :
>  1. What if the placement has been done by a wrapper script or by the
> resource manager ? I.e. how do you know where MPI procs are located ?

 2. How scalable is it ? I would think there an allgather with 1 process per
> node ; am I right ?
>  3. How is that information represented ? As a graph ?


There are two scenarios to consider. When we launch by daemons, each daemon
already uses a collective operation to send back the local node topology
info - all we are doing is adding some deeper levels to the existing
operation as hwloc provides more info than our current sysinfo framework
components. We are then changing the ordering of the operations during
launch - in this mode (i.e., mapping based on topology), we launch daemons
on all nodes in the allocation, and then do the mapping. So once the daemon
collective returns the topology info, we map the procs, construct the launch
msg, and then use the grpcomm collective operation to send that msg to all
daemons. All we are doing is adding the topology and detailed mapping
(bindings, in particular) to that launch msg.

When we launch directly (e.g., launching the apps by srun instead of using
mpirun), the apps use the hierarchical grpcomm during orte_init to perform
their initial modex. This is a collective operation that uses the same basic
algos currently included in the MPI collective layer (i.e., all local ranks
> 0 send to the local_rank=0 proc, that proc engages in a collective with
all other local_rank=0 procs, and then distributes the results locally). As
part of the exchanged info, we already includes the nodename. My intent was
to (a) have the local_rank=0 procs do the local node topology discovery and
include that info in the modex, and (b) have each proc include its affinity
mask in the info. So at the end of modex, everyone has the full info.

Bottom line here is that we are not adding any communications to the
existing system. We are simply adding the topology info to the existing
startup mechanisms. Thus, we can accomplish the exchange of topology info
within the current communications.

The data is currently represented in a simple array. You call the orte ess
APIs to extract it, as per above. If it was helpful, we can always construct
a graph or some other representation from the data.



>
>  So the only thing missing is the switch topology of the cluster (the
>> inter-node topology). We modified carto a while back to support input of
>> switch topology information, though I'm not sure how many people ever used
>> that capability - not much value in it so far. We just set it up so that
>> people could describe the topology, and then let carto compute hop
>> distance.
>>
> Ok. I didn't know we also had some work on switches in carto.
>
>  HTH
>>
> This helps !
>
> So, I'm now wondering if both work, which would seem similar are really
> redundant. We though about this before starting hitopo, and since a graph
> didn't fit our needs, we started work towards computing an address. Perhaps
> hitopo addresses could be computed using hwloc's graph.
>


It would seem that hitopo duplicates some existing functionality that you
may not have realized exists. Some of the new functionality appears
redundant, but I personally would be concerned that hitopo introduces
additional communications instead of piggybacking on the existing operations
such as modex and the launch msg. Some of that may be caused by wanting to
include interface info via tapping into the BTLs, which would require doing
it from the MPI layer. However, that info could still be shared in the
existing modex (thus avoidin

[OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-16 Thread Sébastien Boisvert
Dear awesome community,


Over the last months, I closely followed the evolution of bug 2043,
entitled 'sm BTL hang with GCC 4.4.x'.

https://svn.open-mpi.org/trac/ompi/ticket/2043

The reason is that I am developping an MPI-based software, and I use
Open-MPI as it is the only implementation I am aware of that send
messages eagerly (powerful feature, that is).

http://denovoassembler.sourceforge.net/

I believe that this very pesky bug remains in Open-MPI 1.4.3, and
enclosed to this communication are scientific proofs of my claim, or at
least I think they are ;).


Each byte transfer layer has its default limit to send eagerly a
message. With shared memory (sm), the value is 4096 bytes. At least it
is according to ompi_info.


To verify this limit, I implemented a very simple test. The source code
is test4096.cpp, which basically just send a single message of 4096
bytes from a rank to another (rank 1 to 0).

The test was conclusive: the limit is 4096 bytes (see
mpirun-np-2-Simple.txt).



Then, I implemented a simple program (103 lines) that makes Open-MPI
1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each
rank send a message to a randomly-selected destination. A rank polls for
new messages with MPI_Iprobe. Each rank prints the current time at each
second during 30 seconds. Using this simple code, I ran 4 test cases,
each with a different outcome (use the Makefile if you want to reproduce
the bug).

Before I describe these cases, I will describe the testing hardware. 

I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). 
The computer has 128 GB of physical memory (see
cat-proc-meminfo.txt.gz).
It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see
dmesg.txt.gz & uname.txt).
Default kernel parameters are utilized at runtime (see
sudo-sysctl-a.txt.gz).

The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g
++--version.txt).


I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz,
make-install.out.gz).
Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz).




Now I can describe the cases.


Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.




Case 2: 30 MPI ranks, message size is 1 byte

File: mpirun-np-30-Program-1.txt.gz
Outcome: It runs just fine.




Case 3: 2 MPI ranks, message size is 4096 bytes

File: mpirun-np-2-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.




Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
disabled

File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Outcome: It runs just fine.





A backtrace of the processes in Case 1 is in gdb-bt.txt.gz.




Thank you !

#include
#include
using namespace std;

int main(int argc,char**argv){
	int rank;
	int size;
	MPI_Init(&argc,&argv);
	MPI_Comm_rank(MPI_COMM_WORLD,&rank);
	MPI_Comm_size(MPI_COMM_WORLD,&size);
	cout<<"Rank "

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24047

2010-11-16 Thread Tim Mattox
I see a bug in this code fragment:

+#define MEMMOVE(d, s, l)  \
+do {  \
+if( (((d) < (s)) && (((d) + (l)) > (s))) ||   \
+(((s) < (d)) && (((s) + (l)) > (s))) ) {  \
+memmove( (d), (s), (l) ); \
+} else {  \
+MEMCPY( (d), (s), (l) );  \
+} \
+} while (0)

Shouldn't this line
+(((s) < (d)) && (((s) + (l)) > (s))) ) {  \

be like this instead?
+(((s) < (d)) && (((s) + (l)) > (d))) ) {  \

On Fri, Nov 12, 2010 at 6:22 PM,   wrote:
> Author: bosilca
> Date: 2010-11-12 18:22:35 EST (Fri, 12 Nov 2010)
> New Revision: 24047
> URL: https://svn.open-mpi.org/trac/ompi/changeset/24047
>
> Log:
> Add a second version of the datatype copy function using memmove instead of 
> memcpy.
> As memmove is slower than memcpy, I added the required logic to only use it 
> when
> really necessary.
>
> No modification from developers point of view, you should always call
> opal_datatype_copy_content_same_ddt.
>
>
> Added:
>   trunk/opal/datatype/opal_datatype_copy.h
> Text files modified:
>   trunk/opal/datatype/Makefile.am            |     3
>   trunk/opal/datatype/opal_datatype_copy.c   |   253 
> +++
>   trunk/opal/datatype/opal_datatype_memcpy.h |    28 
>   3 files changed, 48 insertions(+), 236 deletions(-)
>
> Modified: trunk/opal/datatype/Makefile.am
[snip]

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 timat...@open-mpi.org || tmat...@gmail.com
    I'm a bright... http://www.the-brights.net/



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r24047

2010-11-16 Thread George Bosilca
Thanks! Good catch.

  George

On Nov 16, 2010, at 18:26, Tim Mattox  wrote:

> I see a bug in this code fragment:
> 
> +#define MEMMOVE(d, s, l)  \
> +do {  \
> +if( (((d) < (s)) && (((d) + (l)) > (s))) ||   \
> +(((s) < (d)) && (((s) + (l)) > (s))) ) {  \
> +memmove( (d), (s), (l) ); \
> +} else {  \
> +MEMCPY( (d), (s), (l) );  \
> +} \
> +} while (0)
> 
> Shouldn't this line
> +(((s) < (d)) && (((s) + (l)) > (s))) ) {  \
> 
> be like this instead?
> +(((s) < (d)) && (((s) + (l)) > (d))) ) {  \
> 
> On Fri, Nov 12, 2010 at 6:22 PM,   wrote:
>> Author: bosilca
>> Date: 2010-11-12 18:22:35 EST (Fri, 12 Nov 2010)
>> New Revision: 24047
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/24047
>> 
>> Log:
>> Add a second version of the datatype copy function using memmove instead of 
>> memcpy.
>> As memmove is slower than memcpy, I added the required logic to only use it 
>> when
>> really necessary.
>> 
>> No modification from developers point of view, you should always call
>> opal_datatype_copy_content_same_ddt.
>> 
>> 
>> Added:
>>   trunk/opal/datatype/opal_datatype_copy.h
>> Text files modified:
>>   trunk/opal/datatype/Makefile.am| 3
>>   trunk/opal/datatype/opal_datatype_copy.c   |   253 
>> +++
>>   trunk/opal/datatype/opal_datatype_memcpy.h |28 
>>   3 files changed, 48 insertions(+), 236 deletions(-)
>> 
>> Modified: trunk/opal/datatype/Makefile.am
> [snip]
> 
> -- 
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>  timat...@open-mpi.org || tmat...@gmail.com
> I'm a bright... http://www.the-brights.net/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel