Re: [OMPI users] Performance tuning: focus on latency

2007-07-23 Thread Jeff Squyres

On Jul 23, 2007, at 6:43 AM, Biagio Cosenza wrote:

I'm working on a parallel real time renderer: an embarassing  
parallel problem where latency is the threshold to high perfomance.


Two observations:

1) I did a simple "ping-pong" test (the master does a Bcast + an  
IRecv for each node + a Waitall) similar to effective renderer  
workload. Using a cluster of 37 nodes on Gigabit Ethernet, seems  
that the latency is usually low (about 1-5 ms), but sometimes there  
are some peaks of about 200 ms. I thought that the cause is a  
packet retransmission in one of the 37 connections, that blow the  
overall performance of the test (of course, the final WaitAll is a  
synch).


2) A research team argues in a paper  that MPI suffers on  
dynamically manage latency. They also arguing an interesting  
problem about enable/disable Nagle algorithm. (I paste the  
interesting paragraph below)



So I have two questions:

1) Why my test have these peaks? How can I afford them (I think to  
btl tcp params)?


They are probably beyond Open MPI's control -- OMPI mainly does read 
() and write() down TCP sockets and relies on the kernel to do all  
the low-level TCP protocol / wire transmission stuff.


You might want to try increasing your TCP buffer sizes, but I think  
that the Linux kernel has some built in limits.  Other experts might  
want to chime in here...


2) When does OpenMPI disable Nagle algorithm? Suppose I DON'T need  
that Nagle has to be ON (focusing only on latency), how can I  
increase performance?


It looks like we enable Nagle right when TCP BTL connections are  
made.  Surprisingly, it looks like we don't have a run-time option to  
turn it off for power-users like you who want to really tweak around.


If you want to play with it, please edit ompi/mca/btl/tcp/ 
btl_tcp_endpoint.c.  You'll see the references to TCP_NODELAY in  
conjunction with setsockopt().  Set the optval to 0 instead of 1.  A  
simple "make install" in that directory will recompile the TCP  
component and re-install it (assuming you have done a default build  
with OMPI components built as standalone plugins).  Let us know what  
you find.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] sge qdel fails

2007-07-23 Thread Reuti

Hi,

running conventional TCP/IP all is safe AFAICS - all processes will  
be killed on all involved nodes. The problem arises with OFED, with  
which we also have this behavior using MVAPICH.


Unfortunately we have only a limited number of nodes with InfiniBand,  
and hence time to test and develop something is highly limited, as  
users running applications there are in favor.


Am 23.07.2007 um 21:29 schrieb Pak Lui:


Hi Henk,

SLIM H.A. wrote:

Dear Pak Lui

I can delete the (sge) job with qdel -f such that it disappears  
from the

job list but the application processes keep running, including the
shepherds. I have to kill them with -15

For some reason the kill -15 does not reach mpirun. (We use such a
parameter to mpirun on our myrinet mx nodes with mpich, that's why I
asked).


I believe qdel would send a SIGKILL to mpirun


Correct, it's send to the complete process group which qrsh-starter  
spawns up. I.e. "kill -9 -- -processgroup_id".



instead of a SIGTERM
(-15), that is why you don't see the signal reaches mpirun. Since  
there

is no way to catch a SIGKILL so that maybe why the orted and the
processes would keep running.


In a Tightly Integrated parallel environment, there shouldn't be any  
need to catch such a signal. SGE will kill all started processes on  
its own - no further action necessary.



Hmm, this actually reminds me of a related problem. That is with the
qsub -notify option does not work as it intended under ORTE. The qsub
-notify option supposed to send a SIGUSR2 to mpirun and the processes
for an impending SIGKILL N seconds before it actually happens.  
However,

we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the
gridengine modules), therefore user would see the mpirun and orted  
exit

before the user apps can catch the SIGUSR signal. I should file a trac
bug against this SGE feature we don't yet support and fix it  
sometime in

the future.


As SIGUSR2 is send to the complete processgroup (and keep in mind:  
also the job script on its own), it would just mean to ignore  
SIGUSR1/2 in orted (and maybe in mpirun, otherwise it also must be  
trapped there). So it could be included in the action to the --no- 
daemonize option given to orted when running under SGE. For now you  
would also need this in the job script:


#!/bin/sh
trap '' usr2
export PATH=/home/reuti/openmpi-1.2.3/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH${LD_LIBRARY_PATH:+:}/home/ 
reuti/openmpi-1.2.3/lib

(trap '' usr2; exec mpirun -np $NSLOTS /home/reuti/mpihello)


-- Reuti


So back to your problem. Although this is unintended, maybe you can  
try

to run the job with qsub -notify for the mean time until we change for
above, since it will send a SIGUSR2 to mpirun, which should terminate
the mpirun, orted and the user processes in a way that is more
gracefully than qdel (or SIGKILL), because SIGKILL would not allow  
orted

to kill off the user processes, as SIGTERM or SIGUSR1/2 would.



Just to confirm, there is no configure directive specific to  
gridengine

when building openmpi?


Right, there isn't any configure directives currently.



Thanks

henk


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: 23 July 2007 15:16
To: Open MPI Users
Subject: Re: [OMPI users] sge qdel fails

Hi Henk,

The sge script should not require any extra parameter. The
qdel command should send the kill signal to mpirun and also
remove the SGE allocated tmp directory (in something like
/tmp/174.1.all.q/) which contains the OMPI session dir for
the running job, and in turns would cause orted and the user
processes to exit.

Maybe you could try qdel -f  to force delete from the
sge_qmaster, in case when sge_execd does not respond to the
delete request by the sge_qmaster?

SLIM H.A. wrote:

I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2),
following the recommendation in the OpenMPI FAQ

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The job runs but when the user wants to delete the job with

the qdel

command, this fails. Does the mpirun command

mpirun -np $NSLOTS ./exe

in the sge script require extra parameters?

Thanks for any advice

Henk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak@sun.com
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

- Pak Lui
pak@sun.com
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Building OMPI with dated tools & libs

2007-07-23 Thread Jeff Squyres
It *should* work.  We stopped developing for the Cisco (mVAPI) stack  
a while ago, but as far as we know, it still works fine.  See:


http://www.open-mpi.org/faq/?category=openfabrics#vapi-support

That being said, your approach of "it ain't broke, don't fix it" is  
certainly quite reasonable.



On Jul 23, 2007, at 4:51 PM, Jeff Pummill wrote:


Hmmm...compilation SEEMED to go OK with the following .configure...

./configure --prefix=/nfsutil/openmpi-1.2.3 --with-mvapi=/usr/local/ 
topspin/ CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-m64 CXXFLAGS=- 
m64 FFLAGS=-m64 FCFLAGS=-m64


And the following looks promising...

./ompi_info | grep mvapi
MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.3)

I have a post-doc that will test some application code in the next  
day or so. Maybe the old stuff worked just fine!



Jeff F. Pummill
Senior Linux Cluster Administrator
University of Arkansas
Fayetteville, Arkansas 72701



Jeff Pummill wrote:
Good morning all, I have been very impressed so far with OpenMPI  
on one of our smaller clusters running Gnu compilers and Gig-E  
interconnects, so I am  considering a build on our large cluster.  
The potential problem is that the compilers are Intel 8.1 versions  
and the Infiniband is supported by three year old Topspin (now  
Cisco) drivers and libraries. Basically, this is a cluster that  
runs a very heavy workload using MVAPICH, thus we have adopted the  
"if it ain't broke, don't fix it" methodology...thus all of the  
drivers, libraries, and compilers are approximately 3 years old.  
Would it be reasonable to expect OpenMPI 1.2.3 to build and run in  
such an environment?  Thanks! Jeff Pummill University of Arkansas  
___ users mailing list  
us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] MPI_HOME

2007-07-23 Thread Francesco Pietra
openmpi-1.2.3 compiled on Debian Linux amd64 etch with 

./configure CC=/opt/intel/cce/9.1.042/bin/icc
CXX=/opt/intel/cce/9.1.042/bin/icpc F77=/opt/intel/fce/9.1.036/bin/ifort
FC=/opt/intel/fce/9.1.036/bin/ifort --with-libnuma=/usr/lib

ompi_info |grep libnuma

ompi_info |grep maffinity

reported OK, though an attempt to install Amber9 parallel, at

./configure -openmpi ifort_x86_64

reported:

Error, MPI_HOME must be set.

OK, for my installation and bash it should be

export MPI_HOME=/usr/local/openmpi-1.2.3

Not tried, because the above Error message also contained:

Set it where the location of the include/ and lib/ subdirectories containing
mpi.f
libmpi.a
liblam.a
liblamf77mpi.a

which was confusing to me. None of these libraries on my system and I never
advocated lam

Thanks for helping

francesco pietra


  

Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel 
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



Re: [OMPI users] sge qdel fails

2007-07-23 Thread Pak Lui

Hi Henk,

SLIM H.A. wrote:

Dear Pak Lui

I can delete the (sge) job with qdel -f such that it disappears from the
job list but the application processes keep running, including the
shepherds. I have to kill them with -15

For some reason the kill -15 does not reach mpirun. (We use such a
parameter to mpirun on our myrinet mx nodes with mpich, that's why I
asked).


I believe qdel would send a SIGKILL to mpirun instead of a SIGTERM 
(-15), that is why you don't see the signal reaches mpirun. Since there 
is no way to catch a SIGKILL so that maybe why the orted and the 
processes would keep running.


Hmm, this actually reminds me of a related problem. That is with the 
qsub -notify option does not work as it intended under ORTE. The qsub 
-notify option supposed to send a SIGUSR2 to mpirun and the processes 
for an impending SIGKILL N seconds before it actually happens. However, 
we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the 
gridengine modules), therefore user would see the mpirun and orted exit 
before the user apps can catch the SIGUSR signal. I should file a trac 
bug against this SGE feature we don't yet support and fix it sometime in 
the future.


So back to your problem. Although this is unintended, maybe you can try 
to run the job with qsub -notify for the mean time until we change for 
above, since it will send a SIGUSR2 to mpirun, which should terminate 
the mpirun, orted and the user processes in a way that is more 
gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted 
to kill off the user processes, as SIGTERM or SIGUSR1/2 would.




Just to confirm, there is no configure directive specific to gridengine
when building openmpi?


Right, there isn't any configure directives currently.



Thanks

henk


-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui

Sent: 23 July 2007 15:16
To: Open MPI Users
Subject: Re: [OMPI users] sge qdel fails

Hi Henk,

The sge script should not require any extra parameter. The 
qdel command should send the kill signal to mpirun and also 
remove the SGE allocated tmp directory (in something like 
/tmp/174.1.all.q/) which contains the OMPI session dir for 
the running job, and in turns would cause orted and the user 
processes to exit.


Maybe you could try qdel -f  to force delete from the 
sge_qmaster, in case when sge_execd does not respond to the 
delete request by the sge_qmaster?


SLIM H.A. wrote:
I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), 
following the recommendation in the OpenMPI FAQ


http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The job runs but when the user wants to delete the job with 
the qdel 

command, this fails. Does the mpirun command

mpirun -np $NSLOTS ./exe

in the sge script require extra parameters?

Thanks for any advice

Henk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--

- Pak Lui
pak@sun.com
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

- Pak Lui
pak@sun.com


Re: [OMPI users] sge qdel fails

2007-07-23 Thread SLIM H.A.
Dear Pak Lui

I can delete the (sge) job with qdel -f such that it disappears from the
job list but the application processes keep running, including the
shepherds. I have to kill them with -15

For some reason the kill -15 does not reach mpirun. (We use such a
parameter to mpirun on our myrinet mx nodes with mpich, that's why I
asked).

Just to confirm, there is no configure directive specific to gridengine
when building openmpi?

Thanks

henk

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui
> Sent: 23 July 2007 15:16
> To: Open MPI Users
> Subject: Re: [OMPI users] sge qdel fails
> 
> Hi Henk,
> 
> The sge script should not require any extra parameter. The 
> qdel command should send the kill signal to mpirun and also 
> remove the SGE allocated tmp directory (in something like 
> /tmp/174.1.all.q/) which contains the OMPI session dir for 
> the running job, and in turns would cause orted and the user 
> processes to exit.
> 
> Maybe you could try qdel -f  to force delete from the 
> sge_qmaster, in case when sge_execd does not respond to the 
> delete request by the sge_qmaster?
> 
> SLIM H.A. wrote:
> > I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), 
> > following the recommendation in the OpenMPI FAQ
> > 
> > http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
> > 
> > The job runs but when the user wants to delete the job with 
> the qdel 
> > command, this fails. Does the mpirun command
> > 
> > mpirun -np $NSLOTS ./exe
> > 
> > in the sge script require extra parameters?
> > 
> > Thanks for any advice
> > 
> > Henk
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> 
> - Pak Lui
> pak@sun.com
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 



Re: [OMPI users] MPI_File_set_view rejecting subarray views.

2007-07-23 Thread Moreland, Kenneth
Thanks, Brian.  That did the trick.

-Ken

> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
> Behalf Of Brian Barrett
> Sent: Thursday, July 19, 2007 3:39 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] MPI_File_set_view rejecting subarray views.
> 
> On Jul 19, 2007, at 3:24 PM, Moreland, Kenneth wrote:
> 
> > I've run into a problem with the File I/O with openmpi version
1.2.3.
> > It is not possible to call MPI_File_set_view with a datatype created
> > from a subarray.  Instead of letting me set a view of this type, it
> > gives an invalid datatype error.  I have attached a simple program
> > that
> > demonstrates the problem.  In particular, the following sequence of
> > function calls should be supported, but they are not.
> >
> >   MPI_Type_create_subarray(3, sizes, subsizes, starts,
> >MPI_ORDER_FORTRAN, MPI_BYTE, );
> >   MPI_File_set_view(fd, 20, MPI_BYTE, view, "native",
MPI_INFO_NULL);
> >
> > After poking around in the source code a bit, I discovered that the
> > I/O
> > implementation actually supports the subarray data type, but there
> > is a
> > check that is issuing an error before the underlying I/O layer
(ROMIO)
> > has a chance to handle the request.
> 
> You need to commit the datatype after calling
> MPI_Type_create_subarray.  If you add:
> 
>MPI_Type_commit();
> 
> after the Type_create, but before File_set_view, the code will run to
> completion.
> 
> Well, the code will then complain about a Barrier after MPI_Finalize
> due to an error in how we shut down when there are files that have
> been opened but not closed (you should also add a call to
> MPI_File_close after the set_view, but I'm assuming it's not there
> because this is a test code).  This is something we need to fix, but
> also signifies a user error.
> 
> 
> Brian
> 
> --
>Brian W. Barrett
>Networking Team, CCS-1
>Los Alamos National Laboratory
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
Yes...it would indeed.


On 7/23/07 9:03 AM, "Kelley, Sean"  wrote:

> Would this logic be in the bproc pls component?
> Sean
> 
> 
> From: users-boun...@open-mpi.org on behalf of Ralph H Castain
> Sent: Mon 7/23/2007 9:18 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] orterun --bynode/--byslot problem
> 
> No, byslot appears to be working just fine on our bproc clusters (it is the
> default mode). As you probably know, bproc is a little strange in how we
> launch - we have to launch the procs in "waves" that correspond to the
> number of procs on a node.
> 
> In other words, the first "wave" launches a proc on all nodes that have at
> least one proc on them. The second "wave" then launches another proc on all
> nodes that have at least two procs on them, but doesn't launch anything on
> any node that only has one proc on it.
> 
> My guess here is that the system for some reason is insisting that your head
> node be involved in every wave. I confess that we have never tested (to my
> knowledge) a mapping that involves "skipping" a node somewhere in the
> allocation - we always just map from the beginning of the node list, with
> the maximum number of procs being placed on the first nodes in the list
> (since in our machines, the nodes are all the same, so who cares?). So it is
> possible that something in the code objects to skipping around nodes in the
> allocation.
> 
> I will have to look and see where that dependency might lie - will try to
> get to it this week.
> 
> BTW: that patch I sent you for head node operations will be in 1.2.4.
> 
> Ralph
> 
> 
> 
> On 7/23/07 7:04 AM, "Kelley, Sean"  wrote:
> 
>> > Hi,
>> > 
>> >  We are experiencing a problem with the process allocation on our Open
>> MPI
>> > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
>> > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
>> > hardware consists of a head node and N blades on private ethernet and
>> > infiniband networks.
>> > 
>> > The command run for these tests is a simple MPI program (called 'hn') which
>> > prints out the rank and the hostname. The hostname for the head node is
>> 'head'
>> > and the compute nodes are '.0' ... '.9'.
>> > 
>> > We are using the following hostfiles for this example:
>> > 
>> > hostfile7
>> > -1 max_slots=1
>> > 0 max_slots=3
>> > 1 max_slots=3
>> > 
>> > hostfile8
>> > -1 max_slots=2
>> > 0 max_slots=3
>> > 1 max_slots=3
>> > 
>> > hostfile9
>> > -1 max_slots=3
>> > 0 max_slots=3
>> > 1 max_slots=3
>> > 
>> > running the following commands:
>> > 
>> > orterun --hostfile hostfile7 -np 7 ./hn
>> > orterun --hostfile hostfile8 -np 8 ./hn
>> > orterun --byslot --hostfile hostfile7 -np 7 ./hn
>> > orterun --byslot --hostfile hostfile8 -np 8 ./hn
>> > 
>> > causes orterun to crash. However,
>> > 
>> > orterun --hostfile hostfile9 -np 9 ./hn
>> > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
>> > 
>> > works outputing the following:
>> > 
>> > 0 head
>> > 1 head
>> > 2 head
>> > 3 .0
>> > 4 .0
>> > 5 .0
>> > 6 .0
>> > 7 .0
>> > 8 .0
>> > 
>> > However, running the following:
>> > 
>> > orterun --bynode --hostfile hostfile7 -np 7 ./hn
>> > 
>> > works, outputing the following
>> > 
>> > 0 head
>> > 1 .0
>> > 2 .1
>> > 3 .0
>> > 4 .1
>> > 5 .0
>> > 6 .1
>> > 
>> > Is the '--byslot' crash a known problem? Does it have something to do with
>> > BPROC? Thanks in advance for any assistance!
>> > 
>> > Sean
>> > 
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Kelley, Sean
Would this logic be in the bproc pls component?
Sean



From: users-boun...@open-mpi.org on behalf of Ralph H Castain
Sent: Mon 7/23/2007 9:18 AM
To: Open MPI Users 
Subject: Re: [OMPI users] orterun --bynode/--byslot problem



No, byslot appears to be working just fine on our bproc clusters (it is the
default mode). As you probably know, bproc is a little strange in how we
launch - we have to launch the procs in "waves" that correspond to the
number of procs on a node.

In other words, the first "wave" launches a proc on all nodes that have at
least one proc on them. The second "wave" then launches another proc on all
nodes that have at least two procs on them, but doesn't launch anything on
any node that only has one proc on it.

My guess here is that the system for some reason is insisting that your head
node be involved in every wave. I confess that we have never tested (to my
knowledge) a mapping that involves "skipping" a node somewhere in the
allocation - we always just map from the beginning of the node list, with
the maximum number of procs being placed on the first nodes in the list
(since in our machines, the nodes are all the same, so who cares?). So it is
possible that something in the code objects to skipping around nodes in the
allocation.

I will have to look and see where that dependency might lie - will try to
get to it this week.

BTW: that patch I sent you for head node operations will be in 1.2.4.

Ralph



On 7/23/07 7:04 AM, "Kelley, Sean"  wrote:

> Hi,
> 
>  We are experiencing a problem with the process allocation on our Open MPI
> cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
> drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
> hardware consists of a head node and N blades on private ethernet and
> infiniband networks.
> 
> The command run for these tests is a simple MPI program (called 'hn') which
> prints out the rank and the hostname. The hostname for the head node is 'head'
> and the compute nodes are '.0' ... '.9'.
> 
> We are using the following hostfiles for this example:
> 
> hostfile7
> -1 max_slots=1
> 0 max_slots=3
> 1 max_slots=3
> 
> hostfile8
> -1 max_slots=2
> 0 max_slots=3
> 1 max_slots=3
> 
> hostfile9
> -1 max_slots=3
> 0 max_slots=3
> 1 max_slots=3
> 
> running the following commands:
> 
> orterun --hostfile hostfile7 -np 7 ./hn
> orterun --hostfile hostfile8 -np 8 ./hn
> orterun --byslot --hostfile hostfile7 -np 7 ./hn
> orterun --byslot --hostfile hostfile8 -np 8 ./hn
> 
> causes orterun to crash. However,
> 
> orterun --hostfile hostfile9 -np 9 ./hn
> ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
> 
> works outputing the following:
> 
> 0 head
> 1 head
> 2 head
> 3 .0
> 4 .0
> 5 .0
> 6 .0
> 7 .0
> 8 .0
> 
> However, running the following:
> 
> orterun --bynode --hostfile hostfile7 -np 7 ./hn
> 
> works, outputing the following
> 
> 0 head
> 1 .0
> 2 .1
> 3 .0
> 4 .1
> 5 .0
> 6 .1
> 
> Is the '--byslot' crash a known problem? Does it have something to do with
> BPROC? Thanks in advance for any assistance!
> 
> Sean
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Building OMPI with dated tools & libs

2007-07-23 Thread Jeff Pummill

Good morning all,

I have been very impressed so far with OpenMPI on one of our smaller 
clusters running Gnu compilers and Gig-E interconnects, so I am 
considering a build on our large cluster. The potential problem is that 
the compilers are Intel 8.1 versions and the Infiniband is supported by 
three year old Topspin (now Cisco) drivers and libraries. Basically, 
this is a cluster that runs a very heavy workload using MVAPICH, thus we 
have adopted the "if it ain't broke, don't fix it" methodology...thus 
all of the drivers, libraries, and compilers are approximately 3 years old.


Would it be reasonable to expect OpenMPI 1.2.3 to build and run in such 
an environment?


Thanks!

Jeff Pummill
University of Arkansas


Re: [OMPI users] sge qdel fails

2007-07-23 Thread Pak Lui

Hi Henk,

The sge script should not require any extra parameter. The qdel command 
should send the kill signal to mpirun and also remove the SGE allocated 
tmp directory (in something like /tmp/174.1.all.q/) which contains the 
OMPI session dir for the running job, and in turns would cause orted and 
the user processes to exit.


Maybe you could try qdel -f  to force delete from the sge_qmaster, 
in case when sge_execd does not respond to the delete request by the 
sge_qmaster?


SLIM H.A. wrote:

I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2),
following the recommendation in the OpenMPI FAQ

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The job runs but when the user wants to delete the job with the qdel
command, this fails. Does the mpirun command

mpirun -np $NSLOTS ./exe

in the sge script require extra parameters?

Thanks for any advice

Henk

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

- Pak Lui
pak@sun.com


Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)

2007-07-23 Thread Andrea
> > From: Jeff Squyres 
> >
> > Can you be a bit more specific than "it dies"? Are you talking about
> > mpif90/mpif77, or your app?
>
> Sorry, tuspid me. When executing mpif90 or mpif77 I have a segfault and it
> doesn't compile. I've tried both with or without input (i.e., giving it
> something to compile or just executing it expecting to see the normal "no
> files given" kind of message). The intel suite compiled openmpi without
> problems.

Hello,

I've the same problem: when I try to run any mpi-command (like mpicc, mpirun,
ompi_info, ...) I recive a "Segmentation Fault". I've tried both openMPI version
1.2.3 and version 1.2.4b0, but all I get is:

$ ompi_info --all
Segmentation fault

Some info on my system:

 - GNU/Linux, 2.6.22 Kernel, Slackware 12.0
 - Genuine Intel(R) CPU, T2400  @ 1.83GHz GenuineIntel (Toshiba A-100 Laptop)

 - Intel C Compiler 9.1.047
 - Intel Fortran Compiler 9.1.041

The configure script options I've used are:

--prefix=/usr CC=icc CXX=icpc F77=ifort FC=ifort

If you need more info just tell me.

Thank you for you attention.

Andrea



Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Ralph H Castain
No, byslot appears to be working just fine on our bproc clusters (it is the
default mode). As you probably know, bproc is a little strange in how we
launch - we have to launch the procs in "waves" that correspond to the
number of procs on a node.

In other words, the first "wave" launches a proc on all nodes that have at
least one proc on them. The second "wave" then launches another proc on all
nodes that have at least two procs on them, but doesn't launch anything on
any node that only has one proc on it.

My guess here is that the system for some reason is insisting that your head
node be involved in every wave. I confess that we have never tested (to my
knowledge) a mapping that involves "skipping" a node somewhere in the
allocation - we always just map from the beginning of the node list, with
the maximum number of procs being placed on the first nodes in the list
(since in our machines, the nodes are all the same, so who cares?). So it is
possible that something in the code objects to skipping around nodes in the
allocation.

I will have to look and see where that dependency might lie - will try to
get to it this week.

BTW: that patch I sent you for head node operations will be in 1.2.4.

Ralph



On 7/23/07 7:04 AM, "Kelley, Sean"  wrote:

> Hi,
>  
>  We are experiencing a problem with the process allocation on our Open MPI
> cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
> drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
> hardware consists of a head node and N blades on private ethernet and
> infiniband networks.
>  
> The command run for these tests is a simple MPI program (called 'hn') which
> prints out the rank and the hostname. The hostname for the head node is 'head'
> and the compute nodes are '.0' ... '.9'.
>  
> We are using the following hostfiles for this example:
>  
> hostfile7
> -1 max_slots=1
> 0 max_slots=3
> 1 max_slots=3
>  
> hostfile8
> -1 max_slots=2
> 0 max_slots=3
> 1 max_slots=3
>  
> hostfile9
> -1 max_slots=3
> 0 max_slots=3
> 1 max_slots=3
>  
> running the following commands:
>  
> orterun --hostfile hostfile7 -np 7 ./hn
> orterun --hostfile hostfile8 -np 8 ./hn
> orterun --byslot --hostfile hostfile7 -np 7 ./hn
> orterun --byslot --hostfile hostfile8 -np 8 ./hn
>  
> causes orterun to crash. However,
>  
> orterun --hostfile hostfile9 -np 9 ./hn
> ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
>  
> works outputing the following:
>  
> 0 head
> 1 head
> 2 head
> 3 .0
> 4 .0
> 5 .0
> 6 .0
> 7 .0
> 8 .0
>  
> However, running the following:
>  
> orterun --bynode --hostfile hostfile7 -np 7 ./hn
>  
> works, outputing the following
>  
> 0 head
> 1 .0
> 2 .1
> 3 .0
> 4 .1
> 5 .0
> 6 .1
>  
> Is the '--byslot' crash a known problem? Does it have something to do with
> BPROC? Thanks in advance for any assistance!
>  
> Sean
>  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Problems starting mpi program via a system call from within a mpi program

2007-07-23 Thread Per Madsen

Hi

I am in the process of moving a parallel program from our old 32 bit based 
(Xeon @ 2.8 GHz) Linux cluster to a new EM64T (Intel Xeon  5160  @ 3.00GHz) 
base linux cluster. 

OS on the old cluster is Redhat 9 and Fedora 7 on the new cluster.

I have installed the Intel Fortran compiler version 10.0 and openmpi-1.2.3.

I configured openmpi with “--prefix=/opt/openmpi –F77=ifort –FC=ifort.
config.log and the output from ompi_info --all are in the attached files.



/opt/ is mounted on all nodes in the cluster.

The program causing me problems, is a program that solves two large 
interrelated systems of equations (+200.000.000 eq.) using PCG iteration. The 
program starts to iterate on the first system until a certain degree of 
convergence is reached, then the master node executes a shell script which 
starts the parallel solver on the second system. Again the iteration is 
continued until certain degree of convergence, some parameters from solving the 
second system is stored in different files. After the solving of the second 
system, the stored parameters is used in the solver for the first system. Both 
before and after the master node makes the system call the nodes are 
synchronized via  calls of MPI_BARRIER.

This setup has worked fine on the old cluster, but on the new cluster, The 
system call do not start the parallel solver for the second system. The solver 
program is very complex, so I have med some small Fortran programs and shell 
scripts that illustrates the problem.

The setup is as follows:

mpi_master starts mpi on a number of nodes and checks that the nodes is alive. 
The master then executes the shell script serial.sh via a system call, thats 
starts a serial Fortran program serial_subprog). After return from the system 
call, the master executes the shell script mpi.sh. This script tries to start 
mpi_subprog via mpirun. 

I have used mpif90 to compile the mpi programs and ifort to compile the serial 
program.

Mpi_main starts as expected, the call of serial.sh starts the serial program as 
expected. However, the system call to execute the mpi.sh do not start 
mpi_subprog.

The Fortran programs and scripts are in the attached file test.tar.gz. 


When I run the setup via:
 
mpirun -np 4 -hostfile nodelist ./mpi_main 

I get the following:

MPI_INIT return code:0
 MPI_INIT return code:0
 MPI_COMM_RANK return code:0
 MPI_COMM_SIZE return code:0
 Process1  of2  is alive - Hostname= c01b04
   1  :   19
 MPI_COMM_RANK return code:0
 MPI_COMM_SIZE return code:0
 Process0  of2  is alive - Hostname= c01b05
   0  :   19
 MYID:1  MPI_REDUCE 1 red_chk_sum=   0  rc=   0
 MYID:0  MPI_REDUCE 1 red_chk_sum=   2  rc=   0
 MYID:1  MPI_BARRIER 1 RC=0
 MYID:0  MPI_BARRIER 1 RC=0

 Master will now execute the shell script serial.sh

This is from serial.sh

 We are now in the serial subprogram

 Master back from the shell script serial.sh
 IERR=0

 Master will now execute the shell script mpi.sh

This is from mpi.sh
/nav/denmark/navper19/mpi_test
[c01b05.ctrl.ghpc.dk:25337] OOB: Connection to HNP lost

 Master back from the shell script mpi.sh
 IERR=0

 MYID:0  MPI_BARRIER 2 RC=0
 MYID:0  MPI_REDUCE 2 red_chk_sum=  20  rc=   0
 MYID:1  MPI_BARRIER 2 RC=0
 MYID:1  MPI_REDUCE 2 red_chk_sum=   0  rc=   0

As you can see, the execution on the serial program works, while the mpi 
program is not started.

I have checked that mpirun is in the PATH in the shell started by the system 
call, and I have checked the the mpi.sh script works if it is executed from the 
command prompt. Output from a run with mpirun options -v -d are in the attached 
file test.tar.gz.

Is there anyone out there that have tried to do some thing similar? 

Regards

Per Madsen
Senior scientist

   
 AARHUS UNIVERSITET / UNIVERSITY OF AARHUS 
Det Jordbrugsvidenskabelige Fakultet / Faculty of Agricultural Sciences
Forskningscenter Foulum / Research Centre Foulum   
Genetik og Bioteknologi / Dept. of Genetics and Biotechnology  
Blichers Allé 20, P.O. BOX 50  
DK-8830 Tjele  
   






config.log.gz
Description: config.log.gz
eth0  Link encap:Ethernet  HWaddr 00:14:5E:C2:BB:E4  
  inet addr:10.55.55.65  Bcast:10.55.55.255  Mask:255.255.255.0
  inet6 addr: fe80::214:5eff:fec2:bbe4/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:140268254 errors:0 dropped:0 overruns:0 frame:0
  TX packets:166380187 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000 
  RX bytes:138443717024 (128.9 GiB)  TX bytes:201070313859 (187.2 GiB)
  

[OMPI users] Performance tuning: focus on latency

2007-07-23 Thread Biagio Cosenza

Hello,
I'm working on a parallel real time renderer: an embarassing parallel
problem where latency is the threshold to high perfomance.

Two observations:

1) I did a simple "ping-pong" test (the master does a Bcast + an IRecv for
each node + a Waitall) similar to effective renderer workload. Using a
cluster of 37 nodes on Gigabit Ethernet, seems that the latency is usually
low (about 1-5 ms), but sometimes there are some peaks of about 200 ms. I
thought that the cause is a packet retransmission in one of the 37
connections, that blow the overall performance of the test (of course, the
final WaitAll is a synch).

2) A research team argues in a paper  that MPI suffers on dynamically manage
latency. They also arguing an interesting problem about enable/disable Nagle
algorithm. (I paste the interesting paragraph below)


So I have two questions:

1) Why my test have these peaks? How can I afford them (I think to btl tcp
params)?

2) When does OpenMPI disable Nagle algorithm? Suppose I DON'T need that
Nagle has to be ON (focusing only on latency), how can I increase
performance?

Any useful suggestion will be REALLY appreciate.

Thanks in advance,
Biagio Cosenza

-
cut from "Interactive Ray Tracing on Commodity PC clusters"
Saarland University, Germany

"... Communication Method: For handling communication, most parallel
processing systems today use standardized libraries such as MPI [8] or PVM
[10]. Although these libraries provide very powerful tools for development
of distributed software, they do not meet the efficiency requirements that
we face in an interactive environment.
Therefore, we had to implement all communication from scratch with standard
UNIX TCP/IP calls. Though this requires significant efforts, it allows to
extract the maximum performance out of the network. For example, consider
the 'Nagle' optimization implemented in the TCP/IP protocol, which delays
small packets for a short time period to possibly combine them with
successive packets to generate networkfriendly packet sizes. This
optimization can result in a better throughput when lots of
small packets are sent, but can also lead to considerable latencies, if a
packet gets delayed several times. Direct control of the systems
communication allows to use such optimizations selectively: For example, we
turn the Nagle optimization on for sockets in which updated scene data is
streamed to the clients, as throughput is the main issue here. On the other
hand, we turn it off for e.g. sockets used to send tiles to the clients, as
this has to be done with an absolute minimum of latency. A similar behavior
would be hard to achieve with standard communication libraries. ..."
-


[OMPI users] EuroPVM/MPI'07 -- Call for Participation

2007-07-23 Thread Derrick Kondo

Call for Participation: EuroPVM/MPI'07

  http://www.pvmmpi07.org

Please join us for the 14th European PVM/MPI Users' Group
conference, which will be held in Paris, France from
September 30 to October 3. This conference is a forum for
the discussion and presentation of recent advances and major
challenges in Message Passing programming of clusters and
other parallel machines.

The conference will feature six keynote talks from pioneers
and global leaders of message passing and parallel machines,
namely:

Tony Hey, Microsoft Research, USA
Al Geist, Oak Ridge National Laboratory, USA
Ewing Lusk, Argonne National Laboratory, USA
Satoshi Matsuoka, Tokyo Institute of Technology, Japan
Bernd Mohr, Central Institute for Applied Mathematics, Germany
George Bosilca, University of Tennessee, USA

Afterwards, there will be an open forum where attendees can
discuss recent modifications to the message passing
standards and future directions.  Also, the conference is a
unique opportunity to meet the major developers and
designers of communication libraries for HPC (such as PVM
and MPI) and the major high-speed network interface builders
to shape future research and development.

The conference program and registration information can be
found at:

  http://www.pvmmpi07.org

Register soon to take advantage of the discount rates
offered by the conference hotels.

PC Chairs of EuroPVM/MPI'07:
Thomas Herault, University of Paris Sud-XI / INRIA Futurs, France
Franck Cappello, INRIA Futurs, France


[OMPI users] sge qdel fails

2007-07-23 Thread SLIM H.A.

I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2),
following the recommendation in the OpenMPI FAQ

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

The job runs but when the user wants to delete the job with the qdel
command, this fails. Does the mpirun command

mpirun -np $NSLOTS ./exe

in the sge script require extra parameters?

Thanks for any advice

Henk