Re: [OMPI users] Performance tuning: focus on latency
On Jul 23, 2007, at 6:43 AM, Biagio Cosenza wrote: I'm working on a parallel real time renderer: an embarassing parallel problem where latency is the threshold to high perfomance. Two observations: 1) I did a simple "ping-pong" test (the master does a Bcast + an IRecv for each node + a Waitall) similar to effective renderer workload. Using a cluster of 37 nodes on Gigabit Ethernet, seems that the latency is usually low (about 1-5 ms), but sometimes there are some peaks of about 200 ms. I thought that the cause is a packet retransmission in one of the 37 connections, that blow the overall performance of the test (of course, the final WaitAll is a synch). 2) A research team argues in a paper that MPI suffers on dynamically manage latency. They also arguing an interesting problem about enable/disable Nagle algorithm. (I paste the interesting paragraph below) So I have two questions: 1) Why my test have these peaks? How can I afford them (I think to btl tcp params)? They are probably beyond Open MPI's control -- OMPI mainly does read () and write() down TCP sockets and relies on the kernel to do all the low-level TCP protocol / wire transmission stuff. You might want to try increasing your TCP buffer sizes, but I think that the Linux kernel has some built in limits. Other experts might want to chime in here... 2) When does OpenMPI disable Nagle algorithm? Suppose I DON'T need that Nagle has to be ON (focusing only on latency), how can I increase performance? It looks like we enable Nagle right when TCP BTL connections are made. Surprisingly, it looks like we don't have a run-time option to turn it off for power-users like you who want to really tweak around. If you want to play with it, please edit ompi/mca/btl/tcp/ btl_tcp_endpoint.c. You'll see the references to TCP_NODELAY in conjunction with setsockopt(). Set the optval to 0 instead of 1. A simple "make install" in that directory will recompile the TCP component and re-install it (assuming you have done a default build with OMPI components built as standalone plugins). Let us know what you find. -- Jeff Squyres Cisco Systems
Re: [OMPI users] sge qdel fails
Hi, running conventional TCP/IP all is safe AFAICS - all processes will be killed on all involved nodes. The problem arises with OFED, with which we also have this behavior using MVAPICH. Unfortunately we have only a limited number of nodes with InfiniBand, and hence time to test and develop something is highly limited, as users running applications there are in favor. Am 23.07.2007 um 21:29 schrieb Pak Lui: Hi Henk, SLIM H.A. wrote: Dear Pak Lui I can delete the (sge) job with qdel -f such that it disappears from the job list but the application processes keep running, including the shepherds. I have to kill them with -15 For some reason the kill -15 does not reach mpirun. (We use such a parameter to mpirun on our myrinet mx nodes with mpich, that's why I asked). I believe qdel would send a SIGKILL to mpirun Correct, it's send to the complete process group which qrsh-starter spawns up. I.e. "kill -9 -- -processgroup_id". instead of a SIGTERM (-15), that is why you don't see the signal reaches mpirun. Since there is no way to catch a SIGKILL so that maybe why the orted and the processes would keep running. In a Tightly Integrated parallel environment, there shouldn't be any need to catch such a signal. SGE will kill all started processes on its own - no further action necessary. Hmm, this actually reminds me of a related problem. That is with the qsub -notify option does not work as it intended under ORTE. The qsub -notify option supposed to send a SIGUSR2 to mpirun and the processes for an impending SIGKILL N seconds before it actually happens. However, we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the gridengine modules), therefore user would see the mpirun and orted exit before the user apps can catch the SIGUSR signal. I should file a trac bug against this SGE feature we don't yet support and fix it sometime in the future. As SIGUSR2 is send to the complete processgroup (and keep in mind: also the job script on its own), it would just mean to ignore SIGUSR1/2 in orted (and maybe in mpirun, otherwise it also must be trapped there). So it could be included in the action to the --no- daemonize option given to orted when running under SGE. For now you would also need this in the job script: #!/bin/sh trap '' usr2 export PATH=/home/reuti/openmpi-1.2.3/bin:$PATH export LD_LIBRARY_PATH=$LD_LIBRARY_PATH${LD_LIBRARY_PATH:+:}/home/ reuti/openmpi-1.2.3/lib (trap '' usr2; exec mpirun -np $NSLOTS /home/reuti/mpihello) -- Reuti So back to your problem. Although this is unintended, maybe you can try to run the job with qsub -notify for the mean time until we change for above, since it will send a SIGUSR2 to mpirun, which should terminate the mpirun, orted and the user processes in a way that is more gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted to kill off the user processes, as SIGTERM or SIGUSR1/2 would. Just to confirm, there is no configure directive specific to gridengine when building openmpi? Right, there isn't any configure directives currently. Thanks henk -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui Sent: 23 July 2007 15:16 To: Open MPI Users Subject: Re: [OMPI users] sge qdel fails Hi Henk, The sge script should not require any extra parameter. The qdel command should send the kill signal to mpirun and also remove the SGE allocated tmp directory (in something like /tmp/174.1.all.q/) which contains the OMPI session dir for the running job, and in turns would cause orted and the user processes to exit. Maybe you could try qdel -f to force delete from the sge_qmaster, in case when sge_execd does not respond to the delete request by the sge_qmaster? SLIM H.A. wrote: I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), following the recommendation in the OpenMPI FAQ http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge The job runs but when the user wants to delete the job with the qdel command, this fails. Does the mpirun command mpirun -np $NSLOTS ./exe in the sge script require extra parameters? Thanks for any advice Henk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Building OMPI with dated tools & libs
It *should* work. We stopped developing for the Cisco (mVAPI) stack a while ago, but as far as we know, it still works fine. See: http://www.open-mpi.org/faq/?category=openfabrics#vapi-support That being said, your approach of "it ain't broke, don't fix it" is certainly quite reasonable. On Jul 23, 2007, at 4:51 PM, Jeff Pummill wrote: Hmmm...compilation SEEMED to go OK with the following .configure... ./configure --prefix=/nfsutil/openmpi-1.2.3 --with-mvapi=/usr/local/ topspin/ CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-m64 CXXFLAGS=- m64 FFLAGS=-m64 FCFLAGS=-m64 And the following looks promising... ./ompi_info | grep mvapi MCA btl: mvapi (MCA v1.0, API v1.0.1, Component v1.2.3) I have a post-doc that will test some application code in the next day or so. Maybe the old stuff worked just fine! Jeff F. Pummill Senior Linux Cluster Administrator University of Arkansas Fayetteville, Arkansas 72701 Jeff Pummill wrote: Good morning all, I have been very impressed so far with OpenMPI on one of our smaller clusters running Gnu compilers and Gig-E interconnects, so I am considering a build on our large cluster. The potential problem is that the compilers are Intel 8.1 versions and the Infiniband is supported by three year old Topspin (now Cisco) drivers and libraries. Basically, this is a cluster that runs a very heavy workload using MVAPICH, thus we have adopted the "if it ain't broke, don't fix it" methodology...thus all of the drivers, libraries, and compilers are approximately 3 years old. Would it be reasonable to expect OpenMPI 1.2.3 to build and run in such an environment? Thanks! Jeff Pummill University of Arkansas ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] MPI_HOME
openmpi-1.2.3 compiled on Debian Linux amd64 etch with ./configure CC=/opt/intel/cce/9.1.042/bin/icc CXX=/opt/intel/cce/9.1.042/bin/icpc F77=/opt/intel/fce/9.1.036/bin/ifort FC=/opt/intel/fce/9.1.036/bin/ifort --with-libnuma=/usr/lib ompi_info |grep libnuma ompi_info |grep maffinity reported OK, though an attempt to install Amber9 parallel, at ./configure -openmpi ifort_x86_64 reported: Error, MPI_HOME must be set. OK, for my installation and bash it should be export MPI_HOME=/usr/local/openmpi-1.2.3 Not tried, because the above Error message also contained: Set it where the location of the include/ and lib/ subdirectories containing mpi.f libmpi.a liblam.a liblamf77mpi.a which was confusing to me. None of these libraries on my system and I never advocated lam Thanks for helping francesco pietra Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7
Re: [OMPI users] sge qdel fails
Hi Henk, SLIM H.A. wrote: Dear Pak Lui I can delete the (sge) job with qdel -f such that it disappears from the job list but the application processes keep running, including the shepherds. I have to kill them with -15 For some reason the kill -15 does not reach mpirun. (We use such a parameter to mpirun on our myrinet mx nodes with mpich, that's why I asked). I believe qdel would send a SIGKILL to mpirun instead of a SIGTERM (-15), that is why you don't see the signal reaches mpirun. Since there is no way to catch a SIGKILL so that maybe why the orted and the processes would keep running. Hmm, this actually reminds me of a related problem. That is with the qsub -notify option does not work as it intended under ORTE. The qsub -notify option supposed to send a SIGUSR2 to mpirun and the processes for an impending SIGKILL N seconds before it actually happens. However, we don't catch SIGUSR2 signal in ORTE specifically for SGE (or the gridengine modules), therefore user would see the mpirun and orted exit before the user apps can catch the SIGUSR signal. I should file a trac bug against this SGE feature we don't yet support and fix it sometime in the future. So back to your problem. Although this is unintended, maybe you can try to run the job with qsub -notify for the mean time until we change for above, since it will send a SIGUSR2 to mpirun, which should terminate the mpirun, orted and the user processes in a way that is more gracefully than qdel (or SIGKILL), because SIGKILL would not allow orted to kill off the user processes, as SIGTERM or SIGUSR1/2 would. Just to confirm, there is no configure directive specific to gridengine when building openmpi? Right, there isn't any configure directives currently. Thanks henk -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui Sent: 23 July 2007 15:16 To: Open MPI Users Subject: Re: [OMPI users] sge qdel fails Hi Henk, The sge script should not require any extra parameter. The qdel command should send the kill signal to mpirun and also remove the SGE allocated tmp directory (in something like /tmp/174.1.all.q/) which contains the OMPI session dir for the running job, and in turns would cause orted and the user processes to exit. Maybe you could try qdel -f to force delete from the sge_qmaster, in case when sge_execd does not respond to the delete request by the sge_qmaster? SLIM H.A. wrote: I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), following the recommendation in the OpenMPI FAQ http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge The job runs but when the user wants to delete the job with the qdel command, this fails. Does the mpirun command mpirun -np $NSLOTS ./exe in the sge script require extra parameters? Thanks for any advice Henk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com
Re: [OMPI users] sge qdel fails
Dear Pak Lui I can delete the (sge) job with qdel -f such that it disappears from the job list but the application processes keep running, including the shepherds. I have to kill them with -15 For some reason the kill -15 does not reach mpirun. (We use such a parameter to mpirun on our myrinet mx nodes with mpich, that's why I asked). Just to confirm, there is no configure directive specific to gridengine when building openmpi? Thanks henk > -Original Message- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of Pak Lui > Sent: 23 July 2007 15:16 > To: Open MPI Users > Subject: Re: [OMPI users] sge qdel fails > > Hi Henk, > > The sge script should not require any extra parameter. The > qdel command should send the kill signal to mpirun and also > remove the SGE allocated tmp directory (in something like > /tmp/174.1.all.q/) which contains the OMPI session dir for > the running job, and in turns would cause orted and the user > processes to exit. > > Maybe you could try qdel -f to force delete from the > sge_qmaster, in case when sge_execd does not respond to the > delete request by the sge_qmaster? > > SLIM H.A. wrote: > > I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), > > following the recommendation in the OpenMPI FAQ > > > > http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge > > > > The job runs but when the user wants to delete the job with > the qdel > > command, this fails. Does the mpirun command > > > > mpirun -np $NSLOTS ./exe > > > > in the sge script require extra parameters? > > > > Thanks for any advice > > > > Henk > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > > - Pak Lui > pak@sun.com > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_File_set_view rejecting subarray views.
Thanks, Brian. That did the trick. -Ken > -Original Message- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Brian Barrett > Sent: Thursday, July 19, 2007 3:39 PM > To: Open MPI Users > Subject: Re: [OMPI users] MPI_File_set_view rejecting subarray views. > > On Jul 19, 2007, at 3:24 PM, Moreland, Kenneth wrote: > > > I've run into a problem with the File I/O with openmpi version 1.2.3. > > It is not possible to call MPI_File_set_view with a datatype created > > from a subarray. Instead of letting me set a view of this type, it > > gives an invalid datatype error. I have attached a simple program > > that > > demonstrates the problem. In particular, the following sequence of > > function calls should be supported, but they are not. > > > > MPI_Type_create_subarray(3, sizes, subsizes, starts, > >MPI_ORDER_FORTRAN, MPI_BYTE, ); > > MPI_File_set_view(fd, 20, MPI_BYTE, view, "native", MPI_INFO_NULL); > > > > After poking around in the source code a bit, I discovered that the > > I/O > > implementation actually supports the subarray data type, but there > > is a > > check that is issuing an error before the underlying I/O layer (ROMIO) > > has a chance to handle the request. > > You need to commit the datatype after calling > MPI_Type_create_subarray. If you add: > >MPI_Type_commit(); > > after the Type_create, but before File_set_view, the code will run to > completion. > > Well, the code will then complain about a Barrier after MPI_Finalize > due to an error in how we shut down when there are files that have > been opened but not closed (you should also add a call to > MPI_File_close after the set_view, but I'm assuming it's not there > because this is a test code). This is something we need to fix, but > also signifies a user error. > > > Brian > > -- >Brian W. Barrett >Networking Team, CCS-1 >Los Alamos National Laboratory > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orterun --bynode/--byslot problem
Yes...it would indeed. On 7/23/07 9:03 AM, "Kelley, Sean"wrote: > Would this logic be in the bproc pls component? > Sean > > > From: users-boun...@open-mpi.org on behalf of Ralph H Castain > Sent: Mon 7/23/2007 9:18 AM > To: Open MPI Users > Subject: Re: [OMPI users] orterun --bynode/--byslot problem > > No, byslot appears to be working just fine on our bproc clusters (it is the > default mode). As you probably know, bproc is a little strange in how we > launch - we have to launch the procs in "waves" that correspond to the > number of procs on a node. > > In other words, the first "wave" launches a proc on all nodes that have at > least one proc on them. The second "wave" then launches another proc on all > nodes that have at least two procs on them, but doesn't launch anything on > any node that only has one proc on it. > > My guess here is that the system for some reason is insisting that your head > node be involved in every wave. I confess that we have never tested (to my > knowledge) a mapping that involves "skipping" a node somewhere in the > allocation - we always just map from the beginning of the node list, with > the maximum number of procs being placed on the first nodes in the list > (since in our machines, the nodes are all the same, so who cares?). So it is > possible that something in the code objects to skipping around nodes in the > allocation. > > I will have to look and see where that dependency might lie - will try to > get to it this week. > > BTW: that patch I sent you for head node operations will be in 1.2.4. > > Ralph > > > > On 7/23/07 7:04 AM, "Kelley, Sean" wrote: > >> > Hi, >> > >> > We are experiencing a problem with the process allocation on our Open >> MPI >> > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband >> > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The >> > hardware consists of a head node and N blades on private ethernet and >> > infiniband networks. >> > >> > The command run for these tests is a simple MPI program (called 'hn') which >> > prints out the rank and the hostname. The hostname for the head node is >> 'head' >> > and the compute nodes are '.0' ... '.9'. >> > >> > We are using the following hostfiles for this example: >> > >> > hostfile7 >> > -1 max_slots=1 >> > 0 max_slots=3 >> > 1 max_slots=3 >> > >> > hostfile8 >> > -1 max_slots=2 >> > 0 max_slots=3 >> > 1 max_slots=3 >> > >> > hostfile9 >> > -1 max_slots=3 >> > 0 max_slots=3 >> > 1 max_slots=3 >> > >> > running the following commands: >> > >> > orterun --hostfile hostfile7 -np 7 ./hn >> > orterun --hostfile hostfile8 -np 8 ./hn >> > orterun --byslot --hostfile hostfile7 -np 7 ./hn >> > orterun --byslot --hostfile hostfile8 -np 8 ./hn >> > >> > causes orterun to crash. However, >> > >> > orterun --hostfile hostfile9 -np 9 ./hn >> > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn >> > >> > works outputing the following: >> > >> > 0 head >> > 1 head >> > 2 head >> > 3 .0 >> > 4 .0 >> > 5 .0 >> > 6 .0 >> > 7 .0 >> > 8 .0 >> > >> > However, running the following: >> > >> > orterun --bynode --hostfile hostfile7 -np 7 ./hn >> > >> > works, outputing the following >> > >> > 0 head >> > 1 .0 >> > 2 .1 >> > 3 .0 >> > 4 .1 >> > 5 .0 >> > 6 .1 >> > >> > Is the '--byslot' crash a known problem? Does it have something to do with >> > BPROC? Thanks in advance for any assistance! >> > >> > Sean >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] orterun --bynode/--byslot problem
Would this logic be in the bproc pls component? Sean From: users-boun...@open-mpi.org on behalf of Ralph H Castain Sent: Mon 7/23/2007 9:18 AM To: Open MPI UsersSubject: Re: [OMPI users] orterun --bynode/--byslot problem No, byslot appears to be working just fine on our bproc clusters (it is the default mode). As you probably know, bproc is a little strange in how we launch - we have to launch the procs in "waves" that correspond to the number of procs on a node. In other words, the first "wave" launches a proc on all nodes that have at least one proc on them. The second "wave" then launches another proc on all nodes that have at least two procs on them, but doesn't launch anything on any node that only has one proc on it. My guess here is that the system for some reason is insisting that your head node be involved in every wave. I confess that we have never tested (to my knowledge) a mapping that involves "skipping" a node somewhere in the allocation - we always just map from the beginning of the node list, with the maximum number of procs being placed on the first nodes in the list (since in our machines, the nodes are all the same, so who cares?). So it is possible that something in the code objects to skipping around nodes in the allocation. I will have to look and see where that dependency might lie - will try to get to it this week. BTW: that patch I sent you for head node operations will be in 1.2.4. Ralph On 7/23/07 7:04 AM, "Kelley, Sean" wrote: > Hi, > > We are experiencing a problem with the process allocation on our Open MPI > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The > hardware consists of a head node and N blades on private ethernet and > infiniband networks. > > The command run for these tests is a simple MPI program (called 'hn') which > prints out the rank and the hostname. The hostname for the head node is 'head' > and the compute nodes are '.0' ... '.9'. > > We are using the following hostfiles for this example: > > hostfile7 > -1 max_slots=1 > 0 max_slots=3 > 1 max_slots=3 > > hostfile8 > -1 max_slots=2 > 0 max_slots=3 > 1 max_slots=3 > > hostfile9 > -1 max_slots=3 > 0 max_slots=3 > 1 max_slots=3 > > running the following commands: > > orterun --hostfile hostfile7 -np 7 ./hn > orterun --hostfile hostfile8 -np 8 ./hn > orterun --byslot --hostfile hostfile7 -np 7 ./hn > orterun --byslot --hostfile hostfile8 -np 8 ./hn > > causes orterun to crash. However, > > orterun --hostfile hostfile9 -np 9 ./hn > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn > > works outputing the following: > > 0 head > 1 head > 2 head > 3 .0 > 4 .0 > 5 .0 > 6 .0 > 7 .0 > 8 .0 > > However, running the following: > > orterun --bynode --hostfile hostfile7 -np 7 ./hn > > works, outputing the following > > 0 head > 1 .0 > 2 .1 > 3 .0 > 4 .1 > 5 .0 > 6 .1 > > Is the '--byslot' crash a known problem? Does it have something to do with > BPROC? Thanks in advance for any assistance! > > Sean > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Building OMPI with dated tools & libs
Good morning all, I have been very impressed so far with OpenMPI on one of our smaller clusters running Gnu compilers and Gig-E interconnects, so I am considering a build on our large cluster. The potential problem is that the compilers are Intel 8.1 versions and the Infiniband is supported by three year old Topspin (now Cisco) drivers and libraries. Basically, this is a cluster that runs a very heavy workload using MVAPICH, thus we have adopted the "if it ain't broke, don't fix it" methodology...thus all of the drivers, libraries, and compilers are approximately 3 years old. Would it be reasonable to expect OpenMPI 1.2.3 to build and run in such an environment? Thanks! Jeff Pummill University of Arkansas
Re: [OMPI users] sge qdel fails
Hi Henk, The sge script should not require any extra parameter. The qdel command should send the kill signal to mpirun and also remove the SGE allocated tmp directory (in something like /tmp/174.1.all.q/) which contains the OMPI session dir for the running job, and in turns would cause orted and the user processes to exit. Maybe you could try qdel -f to force delete from the sge_qmaster, in case when sge_execd does not respond to the delete request by the sge_qmaster? SLIM H.A. wrote: I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), following the recommendation in the OpenMPI FAQ http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge The job runs but when the user wants to delete the job with the qdel command, this fails. Does the mpirun command mpirun -np $NSLOTS ./exe in the sge script require extra parameters? Thanks for any advice Henk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com
Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)
> > From: Jeff Squyres> > > > Can you be a bit more specific than "it dies"? Are you talking about > > mpif90/mpif77, or your app? > > Sorry, tuspid me. When executing mpif90 or mpif77 I have a segfault and it > doesn't compile. I've tried both with or without input (i.e., giving it > something to compile or just executing it expecting to see the normal "no > files given" kind of message). The intel suite compiled openmpi without > problems. Hello, I've the same problem: when I try to run any mpi-command (like mpicc, mpirun, ompi_info, ...) I recive a "Segmentation Fault". I've tried both openMPI version 1.2.3 and version 1.2.4b0, but all I get is: $ ompi_info --all Segmentation fault Some info on my system: - GNU/Linux, 2.6.22 Kernel, Slackware 12.0 - Genuine Intel(R) CPU, T2400 @ 1.83GHz GenuineIntel (Toshiba A-100 Laptop) - Intel C Compiler 9.1.047 - Intel Fortran Compiler 9.1.041 The configure script options I've used are: --prefix=/usr CC=icc CXX=icpc F77=ifort FC=ifort If you need more info just tell me. Thank you for you attention. Andrea
Re: [OMPI users] orterun --bynode/--byslot problem
No, byslot appears to be working just fine on our bproc clusters (it is the default mode). As you probably know, bproc is a little strange in how we launch - we have to launch the procs in "waves" that correspond to the number of procs on a node. In other words, the first "wave" launches a proc on all nodes that have at least one proc on them. The second "wave" then launches another proc on all nodes that have at least two procs on them, but doesn't launch anything on any node that only has one proc on it. My guess here is that the system for some reason is insisting that your head node be involved in every wave. I confess that we have never tested (to my knowledge) a mapping that involves "skipping" a node somewhere in the allocation - we always just map from the beginning of the node list, with the maximum number of procs being placed on the first nodes in the list (since in our machines, the nodes are all the same, so who cares?). So it is possible that something in the code objects to skipping around nodes in the allocation. I will have to look and see where that dependency might lie - will try to get to it this week. BTW: that patch I sent you for head node operations will be in 1.2.4. Ralph On 7/23/07 7:04 AM, "Kelley, Sean"wrote: > Hi, > > We are experiencing a problem with the process allocation on our Open MPI > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The > hardware consists of a head node and N blades on private ethernet and > infiniband networks. > > The command run for these tests is a simple MPI program (called 'hn') which > prints out the rank and the hostname. The hostname for the head node is 'head' > and the compute nodes are '.0' ... '.9'. > > We are using the following hostfiles for this example: > > hostfile7 > -1 max_slots=1 > 0 max_slots=3 > 1 max_slots=3 > > hostfile8 > -1 max_slots=2 > 0 max_slots=3 > 1 max_slots=3 > > hostfile9 > -1 max_slots=3 > 0 max_slots=3 > 1 max_slots=3 > > running the following commands: > > orterun --hostfile hostfile7 -np 7 ./hn > orterun --hostfile hostfile8 -np 8 ./hn > orterun --byslot --hostfile hostfile7 -np 7 ./hn > orterun --byslot --hostfile hostfile8 -np 8 ./hn > > causes orterun to crash. However, > > orterun --hostfile hostfile9 -np 9 ./hn > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn > > works outputing the following: > > 0 head > 1 head > 2 head > 3 .0 > 4 .0 > 5 .0 > 6 .0 > 7 .0 > 8 .0 > > However, running the following: > > orterun --bynode --hostfile hostfile7 -np 7 ./hn > > works, outputing the following > > 0 head > 1 .0 > 2 .1 > 3 .0 > 4 .1 > 5 .0 > 6 .1 > > Is the '--byslot' crash a known problem? Does it have something to do with > BPROC? Thanks in advance for any assistance! > > Sean > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Problems starting mpi program via a system call from within a mpi program
Hi I am in the process of moving a parallel program from our old 32 bit based (Xeon @ 2.8 GHz) Linux cluster to a new EM64T (Intel Xeon 5160 @ 3.00GHz) base linux cluster. OS on the old cluster is Redhat 9 and Fedora 7 on the new cluster. I have installed the Intel Fortran compiler version 10.0 and openmpi-1.2.3. I configured openmpi with --prefix=/opt/openmpi F77=ifort FC=ifort. config.log and the output from ompi_info --all are in the attached files. /opt/ is mounted on all nodes in the cluster. The program causing me problems, is a program that solves two large interrelated systems of equations (+200.000.000 eq.) using PCG iteration. The program starts to iterate on the first system until a certain degree of convergence is reached, then the master node executes a shell script which starts the parallel solver on the second system. Again the iteration is continued until certain degree of convergence, some parameters from solving the second system is stored in different files. After the solving of the second system, the stored parameters is used in the solver for the first system. Both before and after the master node makes the system call the nodes are synchronized via calls of MPI_BARRIER. This setup has worked fine on the old cluster, but on the new cluster, The system call do not start the parallel solver for the second system. The solver program is very complex, so I have med some small Fortran programs and shell scripts that illustrates the problem. The setup is as follows: mpi_master starts mpi on a number of nodes and checks that the nodes is alive. The master then executes the shell script serial.sh via a system call, thats starts a serial Fortran program serial_subprog). After return from the system call, the master executes the shell script mpi.sh. This script tries to start mpi_subprog via mpirun. I have used mpif90 to compile the mpi programs and ifort to compile the serial program. Mpi_main starts as expected, the call of serial.sh starts the serial program as expected. However, the system call to execute the mpi.sh do not start mpi_subprog. The Fortran programs and scripts are in the attached file test.tar.gz. When I run the setup via: mpirun -np 4 -hostfile nodelist ./mpi_main I get the following: MPI_INIT return code:0 MPI_INIT return code:0 MPI_COMM_RANK return code:0 MPI_COMM_SIZE return code:0 Process1 of2 is alive - Hostname= c01b04 1 : 19 MPI_COMM_RANK return code:0 MPI_COMM_SIZE return code:0 Process0 of2 is alive - Hostname= c01b05 0 : 19 MYID:1 MPI_REDUCE 1 red_chk_sum= 0 rc= 0 MYID:0 MPI_REDUCE 1 red_chk_sum= 2 rc= 0 MYID:1 MPI_BARRIER 1 RC=0 MYID:0 MPI_BARRIER 1 RC=0 Master will now execute the shell script serial.sh This is from serial.sh We are now in the serial subprogram Master back from the shell script serial.sh IERR=0 Master will now execute the shell script mpi.sh This is from mpi.sh /nav/denmark/navper19/mpi_test [c01b05.ctrl.ghpc.dk:25337] OOB: Connection to HNP lost Master back from the shell script mpi.sh IERR=0 MYID:0 MPI_BARRIER 2 RC=0 MYID:0 MPI_REDUCE 2 red_chk_sum= 20 rc= 0 MYID:1 MPI_BARRIER 2 RC=0 MYID:1 MPI_REDUCE 2 red_chk_sum= 0 rc= 0 As you can see, the execution on the serial program works, while the mpi program is not started. I have checked that mpirun is in the PATH in the shell started by the system call, and I have checked the the mpi.sh script works if it is executed from the command prompt. Output from a run with mpirun options -v -d are in the attached file test.tar.gz. Is there anyone out there that have tried to do some thing similar? Regards Per Madsen Senior scientist AARHUS UNIVERSITET / UNIVERSITY OF AARHUS Det Jordbrugsvidenskabelige Fakultet / Faculty of Agricultural Sciences Forskningscenter Foulum / Research Centre Foulum Genetik og Bioteknologi / Dept. of Genetics and Biotechnology Blichers Allé 20, P.O. BOX 50 DK-8830 Tjele config.log.gz Description: config.log.gz eth0 Link encap:Ethernet HWaddr 00:14:5E:C2:BB:E4 inet addr:10.55.55.65 Bcast:10.55.55.255 Mask:255.255.255.0 inet6 addr: fe80::214:5eff:fec2:bbe4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:140268254 errors:0 dropped:0 overruns:0 frame:0 TX packets:166380187 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:138443717024 (128.9 GiB) TX bytes:201070313859 (187.2 GiB)
[OMPI users] Performance tuning: focus on latency
Hello, I'm working on a parallel real time renderer: an embarassing parallel problem where latency is the threshold to high perfomance. Two observations: 1) I did a simple "ping-pong" test (the master does a Bcast + an IRecv for each node + a Waitall) similar to effective renderer workload. Using a cluster of 37 nodes on Gigabit Ethernet, seems that the latency is usually low (about 1-5 ms), but sometimes there are some peaks of about 200 ms. I thought that the cause is a packet retransmission in one of the 37 connections, that blow the overall performance of the test (of course, the final WaitAll is a synch). 2) A research team argues in a paper that MPI suffers on dynamically manage latency. They also arguing an interesting problem about enable/disable Nagle algorithm. (I paste the interesting paragraph below) So I have two questions: 1) Why my test have these peaks? How can I afford them (I think to btl tcp params)? 2) When does OpenMPI disable Nagle algorithm? Suppose I DON'T need that Nagle has to be ON (focusing only on latency), how can I increase performance? Any useful suggestion will be REALLY appreciate. Thanks in advance, Biagio Cosenza - cut from "Interactive Ray Tracing on Commodity PC clusters" Saarland University, Germany "... Communication Method: For handling communication, most parallel processing systems today use standardized libraries such as MPI [8] or PVM [10]. Although these libraries provide very powerful tools for development of distributed software, they do not meet the efficiency requirements that we face in an interactive environment. Therefore, we had to implement all communication from scratch with standard UNIX TCP/IP calls. Though this requires significant efforts, it allows to extract the maximum performance out of the network. For example, consider the 'Nagle' optimization implemented in the TCP/IP protocol, which delays small packets for a short time period to possibly combine them with successive packets to generate networkfriendly packet sizes. This optimization can result in a better throughput when lots of small packets are sent, but can also lead to considerable latencies, if a packet gets delayed several times. Direct control of the systems communication allows to use such optimizations selectively: For example, we turn the Nagle optimization on for sockets in which updated scene data is streamed to the clients, as throughput is the main issue here. On the other hand, we turn it off for e.g. sockets used to send tiles to the clients, as this has to be done with an absolute minimum of latency. A similar behavior would be hard to achieve with standard communication libraries. ..." -
[OMPI users] EuroPVM/MPI'07 -- Call for Participation
Call for Participation: EuroPVM/MPI'07 http://www.pvmmpi07.org Please join us for the 14th European PVM/MPI Users' Group conference, which will be held in Paris, France from September 30 to October 3. This conference is a forum for the discussion and presentation of recent advances and major challenges in Message Passing programming of clusters and other parallel machines. The conference will feature six keynote talks from pioneers and global leaders of message passing and parallel machines, namely: Tony Hey, Microsoft Research, USA Al Geist, Oak Ridge National Laboratory, USA Ewing Lusk, Argonne National Laboratory, USA Satoshi Matsuoka, Tokyo Institute of Technology, Japan Bernd Mohr, Central Institute for Applied Mathematics, Germany George Bosilca, University of Tennessee, USA Afterwards, there will be an open forum where attendees can discuss recent modifications to the message passing standards and future directions. Also, the conference is a unique opportunity to meet the major developers and designers of communication libraries for HPC (such as PVM and MPI) and the major high-speed network interface builders to shape future research and development. The conference program and registration information can be found at: http://www.pvmmpi07.org Register soon to take advantage of the discount rates offered by the conference hotels. PC Chairs of EuroPVM/MPI'07: Thomas Herault, University of Paris Sud-XI / INRIA Futurs, France Franck Cappello, INRIA Futurs, France
[OMPI users] sge qdel fails
I am using OpenMPI 1.2.3 with SGE 6.0u7 over InfiniBand (OFED 1.2), following the recommendation in the OpenMPI FAQ http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge The job runs but when the user wants to delete the job with the qdel command, this fails. Does the mpirun command mpirun -np $NSLOTS ./exe in the sge script require extra parameters? Thanks for any advice Henk