[OMPI users] Dual Gigabit ethernet support

2006-10-24 Thread Tony Ladd
Lisandro

I use my own network testing program; I wrote it some time ago because
Netpipe only tested 1-way rates at that point. I havent tried IMB but I
looked at the source and its very similar to what I do. 1) set up buffers
with data. 2) Start clock 3) Call MPI_xxx N times 4) Stop clock 5) calculate
rate. IMB tests more things than I do; I just focused on the calls I use
(send recv allreduce). I have done a lot of testing of hardware and
software. I will have some web pages posted soon. I will put a note here
when I do. But a couple of things.
A) I have found the switch is the biggest discriminant if you want to run
HPC under Gigabit ethernet. Most GigE switches choke when all the ports are
being used at once. This is the usual HPC pattern, but not of a typical
network, which is what these switches are geared towards. The one exception
I have found is the Extreme Networks x450a-48t. In some test patterns I
found it to be 500 times faster (not a typo) than the s400-48t, which is its
predecessor. I have tested several GigE switches (Extreme, Force10, HP,
Asante) and the x450 is the only one that copes with high traffic loads in
all port configurations. Its expensive for a GigE switch (~$6500) but worth
it in my opinion if you want to do HPC. Its still much cheaper than
Infiniband.
B) You have to test the switch in different port configurations-a random
ring of SendRecv is good for this. I don't think IMB has it in its test
suite but its easy to program. Or you can change the order of nodes in the
machinefile to force unfavorable port assignments. A step of 12 is a good
test since many GigE switches use 12-port ASICS and this forces all the
traffic onto the backplane. On the Summit 400 this causes it to more or less
stop working-rates drop to a few Kbytes/sec along each wire, but the x450
has no problem with the same test. You need to know how your nodes are wired
to the switch to do this test.
C) GAMMA is an extraordinary accomplishment in my view; in a number of tests
with codes like DLPOLY, GROMACS, VASP it can be 2-3 times the speed of TCP
based programs with 64 cpus. In many instances I get comparable (and
occasionally better) scaling than with the university HPC system which has
an Infiniband interconnect. Note I am not saying GigE is comparable to IB;
but that a typical HPC setup with nodes scattered all over a fat tree
topology (including oversubscription of the links and switches) is enough of
a minus that an optimized GigE set up can compete; at least up to 48 nodes
(96 cpus in our case). I have worked with Giuseppe Ciaccio for the past 9
months eradicating some obscure bugs in GAMMA. I find them; he fixes them.
We have GAMMA running on 48 nodes quite reliably but there are still many
issues to address. GAMMA is very much a research tool-there are a number of
features(?) which would hinder it being used in an HPC environment.
Basically Giuseppe needs help with development. Any volunteers? 

Tony
---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




Re: [OMPI users] Dual Gigabit ethernet support

2006-10-24 Thread Durga Choudhury

Very interesting, indeed! Message passing running over raw Ethernet using
cheap COTS PCs is indeed the need of the hours for people like me who has a
very shallow pocket. Great work! What would make this effort *really* cool
is to have a one-to-one mapping of APIs from MPI domain to GAMMA domain, so,
for example, existing MPI code can be ported with a trivial amount of work.
Professor Ladd, how did you do this porting, e.g. for VASP? How much of an
effort was it? (Or did the VASP guys already had a version running over
GAMMA ?)

Thanks
Durga


On 10/24/06, Tony Ladd  wrote:


Lisandro

I use my own network testing program; I wrote it some time ago because
Netpipe only tested 1-way rates at that point. I havent tried IMB but I
looked at the source and its very similar to what I do. 1) set up buffers
with data. 2) Start clock 3) Call MPI_xxx N times 4) Stop clock 5)
calculate
rate. IMB tests more things than I do; I just focused on the calls I use
(send recv allreduce). I have done a lot of testing of hardware and
software. I will have some web pages posted soon. I will put a note here
when I do. But a couple of things.
A) I have found the switch is the biggest discriminant if you want to run
HPC under Gigabit ethernet. Most GigE switches choke when all the ports
are
being used at once. This is the usual HPC pattern, but not of a typical
network, which is what these switches are geared towards. The one
exception
I have found is the Extreme Networks x450a-48t. In some test patterns I
found it to be 500 times faster (not a typo) than the s400-48t, which is
its
predecessor. I have tested several GigE switches (Extreme, Force10, HP,
Asante) and the x450 is the only one that copes with high traffic loads in
all port configurations. Its expensive for a GigE switch (~$6500) but
worth
it in my opinion if you want to do HPC. Its still much cheaper than
Infiniband.
B) You have to test the switch in different port configurations-a random
ring of SendRecv is good for this. I don't think IMB has it in its test
suite but its easy to program. Or you can change the order of nodes in the
machinefile to force unfavorable port assignments. A step of 12 is a good
test since many GigE switches use 12-port ASICS and this forces all the
traffic onto the backplane. On the Summit 400 this causes it to more or
less
stop working-rates drop to a few Kbytes/sec along each wire, but the x450
has no problem with the same test. You need to know how your nodes are
wired
to the switch to do this test.
C) GAMMA is an extraordinary accomplishment in my view; in a number of
tests
with codes like DLPOLY, GROMACS, VASP it can be 2-3 times the speed of TCP
based programs with 64 cpus. In many instances I get comparable (and
occasionally better) scaling than with the university HPC system which has
an Infiniband interconnect. Note I am not saying GigE is comparable to IB;
but that a typical HPC setup with nodes scattered all over a fat tree
topology (including oversubscription of the links and switches) is enough
of
a minus that an optimized GigE set up can compete; at least up to 48 nodes
(96 cpus in our case). I have worked with Giuseppe Ciaccio for the past 9
months eradicating some obscure bugs in GAMMA. I find them; he fixes them.
We have GAMMA running on 48 nodes quite reliably but there are still many
issues to address. GAMMA is very much a research tool-there are a number
of
features(?) which would hinder it being used in an HPC environment.
Basically Giuseppe needs help with development. Any volunteers?

Tony
---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Devil wanted omnipresence;
He therefore created communists.


[OMPI users] Dual Gigabit Ethernet Support

2006-10-24 Thread Tony Ladd
Durga

I guess we have strayed a bit from the original post. My personal opinion is
that a number of codes can run in HPC-like mode over Gigabit ethernet, not
just the trivially parallelizable. The hardware components are one key;
PCI-X, low hardware latency NIC (Intel PRO 1000 is 6.6 microsecs vs about 14
for the Bcom 5721), and a non-blocking (that's the key word) switch. Then
you need a good driver and a good MPI software layer. At present MPICH is
ahead of LAM/OpenMPI/MVAPICH in its implementation of optimized collectives.
At least that's how it seems to me (let me say that quickly, before I get
flamed). MPICH got a bad rap performance wise because its TCP driver was
mediocre (compared with LAM and OpenMPI). But MPICH + GAMMA is very fast.
MPIGAMMA even beats out our Infiniband cluster running OpenMPI on the
MPI_Allreduce; the test was with 64 cpus-32 nodes on the GAMMA cluster (Dual
core P4) and 16 nodes on the Infiniband (Dual Dual-core Opterons). The IB
cluster worked out at 24MBytes/sec (vector size/time) and the GigE +
MPIGAMMA was 39MBytes/sec. On the other hand, if I use my own optimized
AllReduce (a simplified version of the one in MPICH) on the IB cluster it
gets 108MByte/sec. So the tricky thing is all the components need to be in
place to get good application performance.

GAMMA is not so easy to set up-I had considerable help from Giuseppe. It has
libraries to compile and the kernel needs to be recompiled. Once I got that
automated I can build and install a new version of GAMMA in about 5 mins.
The MPIGAMMA build is just like MPICH and MPIGAMMA works almost exactly the
same. So any application that will compile under MPICH should compile under
MPIGAMMA, just by changing the path. I have run half a dozen apps with
GAMMA. Netpipe, Netbench (my network tester-a simplified version of IMB),
Susp3D (my own code-a CFD like application), DLPOLY all compile out of the
box. Gromacs compiles but has a couple of "bugs" that crash on execution.
One is an archaic test for MPICH that prevents a clean exit-must have been a
bugfix for an earlier version of MPICH. The other seems to be an fclose of
an unassigned file pointer. It works OK in LAM but my guess is its illegal
strictly speaking. A student was supposed to check on that. VASP also
compiles out of the box if you can compile it with MPICH. But there is a
problem with the MPIGAMMA and the MPI_Alltoall function right now. It works
but it suffers from hangups and long delays. So GAMMA is not good for VASP
at this moment. You see the substantial performance improvements sometimes,
but other times its dreadfully slow. I can reproduce the problem with an
AlltoAll test code and Giuseppe is going to try to debug the problem.

So GAMMA is not a pancea. In most circumstances it is stable and
predictable; much more reproducble than MPI over TCP. But there are still
may be one or two bugs and several issues.
1) Since GAMMA is tightly entwined in the kernel a crash frequently brings
the whole system down, which is a bit annoying; also it can crash other
nodes in the same GAMMA Virtual Machine.
2) NIC's are very buggy hardware-if you look at a TCP driver there are a
large number of hardware bugfixes in them. A number of GAMMA problems can be
traced to this. It's a lot of work to reprogram all the workarounds.
3) GAMMA nodes have to be preconfigured at boot. You can run more than one
job on a GAMMA virtual machine, but it's a little iffy; there can be
interactions between nodes on the same VM even if they are running different
jobs. Different GAMMA VM's need a different VLAN. So a multiuser environment
is still problematic.
4) Giuseppe said MPIGAMMA was a very difficult code to write-so I would
guess a port to OpenMPI would not be trivial. Also I would want to see
optimized collectives in OpenMPI before I switched from MPICH

As far as I know GAMMA is the most advanced non TCP protocol. At core it
really works well, but it still needs a lot more testing and development.
Giuseppe is great to work with if anyone out there is interested. Go to the
MPIGAMMA website for more info
http://www.disi.unige.it/project/gamma/mpigamma/index.html.

Tony




[OMPI users] MPI_GATHER: missing f90 interfaces for mixed dimensions

2006-10-24 Thread Michael Kluskens
This is a reminder about an issue I bought up back at the end of May  
2006  and the solution  
was to disable with-mpi-f90-size=large till 1.2.


Testing 1.3a1r12274 and I see that no progress has been made on this  
even though I submited the precise changes needed to expand large for  
MPI_Gather to handle reasonable coding practices.  I'm sure other MPI  
routines are affected by this and the solution is not difficult.


Now I could manually repatch 1.3 every week but it would be better  
for everyone if I was not the only Fortran MPI programmer that could  
use with-mpi-f90-size=large and have arrays in MPI_Gather that are of  
different dimensions.


Michael


Details below (edited)


Look at limitations of the following:

   --with-mpi-f90-size=large
(medium + all MPI functions with 2 choice buffers, but only when both  
buffers are the same type)


Not sure what "same type" was intended to mean here, but same  
dimension is not a good idea and is what is currently implemented.



subroutine MPI_Gather0DI4(sendbuf, sendcount, sendtype, recvbuf,
recvcount, &
 recvtype, root, comm, ierr)
   include 'mpif-common.h'
   integer*4, intent(in) :: sendbuf
   integer, intent(in) :: sendcount
   integer, intent(in) :: sendtype
   integer*4, intent(out) :: recvbuf
   integer, intent(in) :: recvcount
   integer, intent(in) :: recvtype
   integer, intent(in) :: root
   integer, intent(in) :: comm
   integer, intent(out) :: ierr
end subroutine MPI_Gather0DI4

Think about it, all processes are sending data back to root, if each
sends a single integer where does the second, third, fourth, etc.
integer go?


The interfaces for MPI_GATHER do not include the possibility that the
sendbuf is an integer and the recvbuffer is an integer array, for
example the following does not exist but seems legal or should be
legal (and should at the very least replace the above interface):

subroutine MPI_Gather01DI4(sendbuf, sendcount, sendtype, recvbuf,
recvcount, &
 recvtype, root, comm, ierr)
   include 'mpif-common.h'
   integer*4, intent(in) :: sendbuf
   integer, intent(in) :: sendcount
   integer, intent(in) :: sendtype
   integer*4, dimension(:), intent(out) :: recvbuf
   integer, intent(in) :: recvcount
   integer, intent(in) :: recvtype
   integer, intent(in) :: root
   integer, intent(in) :: comm
   integer, intent(out) :: ierr
end subroutine MPI_Gather01DI4


Also, consider that there may be no reason to restrict sendbuf and
recvbuf have the same number of dimensions, but it is reasonable to
expect sendbuf to have the same or less dimensions as recvbuf (except
both being a scalar seems unreasonable).  This does complicate the
issue from an order (N+1) problem to an order (N+1)*(N+2)/2 problem,
where is N = 4 unless otherwise restricted, but should be doable and
certain functions should have the 0,0 case eliminated.

--

Below is my solution for the generating scripts for MPI_Gather for  
F90).  It might be acceptable and reasonable to reduce the  
combinations to just equal or one dimension less (00,01,11,12,22).


Michael

-- mpi-f90-interfaces.h.sh
#---

output_120() {
 if test "$output" = "0"; then
 return 0
 fi
 procedure=$1
 rank=$2
 rank2=$3
 type=$5
 type2=$6
 proc="$1$2$3D$4"
 cat <   call ${procedure}(sendbuf, sendcount, sendtype, recvbuf,  
recvcount, &

 recvtype, root, comm, ierr)
end subroutine ${proc}

EOF
}

for rank in $allranks
do
   case "$rank" in  0)  dim=''  ;  esac
   case "$rank" in  1)  dim=', dimension(:)'  ;  esac
   case "$rank" in  2)  dim=', dimension(:,:)'  ;  esac
   case "$rank" in  3)  dim=', dimension(:,:,:)'  ;  esac
   case "$rank" in  4)  dim=', dimension(:,:,:,:)'  ;  esac
   case "$rank" in  5)  dim=', dimension(:,:,:,:,:)'  ;  esac
   case "$rank" in  6)  dim=', dimension(:,:,:,:,:,:)'  ;  esac
   case "$rank" in  7)  dim=', dimension(:,:,:,:,:,:,:)'  ;  esac

   for rank2 in $allranks
   do
 case "$rank2" in  0)  dim2=''  ;  esac
 case "$rank2" in  1)  dim2=', dimension(:)'  ;  esac
 case "$rank2" in  2)  dim2=', dimension(:,:)'  ;  esac
 case "$rank2" in  3)  dim2=', dimension(:,:,:)'  ;  esac
 case "$rank2" in  4)  dim2=', dimension(:,:,:,:)'  ;  esac
 case "$rank2" in  5)  dim2=', dimension(:,:,:,:,:)'  ;  esac
 case "$rank2" in  6)  dim2=', dimension(:,:,:,:,:,:)'  ;  esac
 case "$rank2" in  7)  dim2=', dimension(:,:,:,:,:,:,:)'  ;  esac

 if [ ${rank2} != "0" ] && [ ${rank2} -ge ${rank} ]; then

   output MPI_Gather ${rank} ${rank2} CH "character${dim}"