Re: [OMPI users] Beowulf cluster and openmpi
I would suggest making sure that the /etc/beowulf/config file has a "libraries" line for every directory where required shared libraries (application and mpi) are located. Also, make sure that the filesystems containing the executables and shared libraries are accessible from the compute nodes. Sean -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Rima Chaudhuri Sent: Monday, November 03, 2008 5:50 PM To: us...@open-mpi.org Subject: Re: [OMPI users] Beowulf cluster and openmpi I added the option for -hostfile machinefile where the machinefile is a file with the IP of the nodes: #host names 192.168.0.100 slots=2 192.168.0.101 slots=2 192.168.0.102 slots=2 192.168.0.103 slots=2 192.168.0.104 slots=2 192.168.0.105 slots=2 192.168.0.106 slots=2 192.168.0.107 slots=2 192.168.0.108 slots=2 192.168.0.109 slots=2 [rchaud@helios amber10]$ ./step1 -- A daemon (pid 29837) launched by the bproc PLS component on node 192 died unexpectedly so we are aborting. This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- [helios.structure.uic.edu:29836] [0,0,0] ORTE_ERROR_LOG: Error in file pls_bproc.c at line 717 [helios.structure.uic.edu:29836] [0,0,0] ORTE_ERROR_LOG: Error in file pls_bproc.c at line 1164 [helios.structure.uic.edu:29836] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line 462 [helios.structure.uic.edu:29836] mpirun: spawn failed with errno=-1 I used bpsh to see if the master and one of the nodes n8 could see the $LD_LIBRARY_PATH, and it does.. [rchaud@helios amber10]$ echo $LD_LIBRARY_PATH /home/rchaud/openmpi-1.2.6/openmpi-1.2.6_ifort/lib [rchaud@helios amber10]$ bpsh n8 echo $LD_LIBRARY_PATH /home/rchaud/openmpi-1.2.6/openmpi-1.2.6_ifort/lib thanks! On Mon, Nov 3, 2008 at 3:14 PM, wrote: > Send users mailing list submissions to >us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit >http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to >users-requ...@open-mpi.org > > You can reach the person managing the list at >users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Problems installing in Cygwin - Problem with GCC 3.4.4 > (Jeff Squyres) > 2. switch from mpich2 to openMPI (PattiMichelle) > 3. Re: users Digest, Vol 1055, Issue 2 (Ralph Castain) > > > -- > > Message: 1 > Date: Mon, 3 Nov 2008 15:52:22 -0500 > From: Jeff Squyres > Subject: Re: [OMPI users] Problems installing in Cygwin - Problem with >GCC 3.4.4 > To: "Gustavo Seabra" > Cc: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > On Nov 3, 2008, at 3:36 PM, Gustavo Seabra wrote: > >>> For your fortran issue, the Fortran 90 interface needs the Fortran >>> 77 interface. So you need to supply an F77 as well (the output from >>> configure should indicate that the F90 interface was disabled >>> because the F77 interface was disabled). >> >> Is that what you mean (see below)? > > Ah yes -- that's another reason the f90 interface could be disabled: > if configure detects that the f77 and f90 compilers are not link- > compatible. > >> I thought the g95 compiler could >> deal with F77 as well as F95... If so, could I just pass F77='g95'? > > That would probably work (F77=g95). I don't know the g95 compiler at > all, so I don't know if it also accepts Fortran-77-style codes. But > if it does, then you're set. Otherwise, specify a different F77 > compiler that is link compatible with g95 and you should be good. I looked in some places in the OpenMPI code, but I couldn't find "max" being redefined anywhere, but I may be looking in the wrong places. Anyways, the only way of found of compiling OpenMPI was a very ugly hack: I have to go into those files and remove the "std::" before the "max". With that, it all compiled cleanly. >>> >>> I'm not sure I follow -- I don't see anywhere in OMPI where we use >>> std::max. >>> What areas did you find that you needed to change? >> >> These files are part of the standard C++ headers. In my case, they >> sit in: >> /usr/lib/gcc/i686-pc-cygwin/3.4.4/include/c++/bits > > Ah, I see. > >> In principle, the problems that comes from those files would mean >> that the OpenMPI source has some macro redefining max, but that's >> what I could not find :-( > > Gotcha. I
Re: [OMPI users] orterun --bynode/--byslot problem
Would this logic be in the bproc pls component? Sean From: users-boun...@open-mpi.org on behalf of Ralph H Castain Sent: Mon 7/23/2007 9:18 AM To: Open MPI Users Subject: Re: [OMPI users] orterun --bynode/--byslot problem No, byslot appears to be working just fine on our bproc clusters (it is the default mode). As you probably know, bproc is a little strange in how we launch - we have to launch the procs in "waves" that correspond to the number of procs on a node. In other words, the first "wave" launches a proc on all nodes that have at least one proc on them. The second "wave" then launches another proc on all nodes that have at least two procs on them, but doesn't launch anything on any node that only has one proc on it. My guess here is that the system for some reason is insisting that your head node be involved in every wave. I confess that we have never tested (to my knowledge) a mapping that involves "skipping" a node somewhere in the allocation - we always just map from the beginning of the node list, with the maximum number of procs being placed on the first nodes in the list (since in our machines, the nodes are all the same, so who cares?). So it is possible that something in the code objects to skipping around nodes in the allocation. I will have to look and see where that dependency might lie - will try to get to it this week. BTW: that patch I sent you for head node operations will be in 1.2.4. Ralph On 7/23/07 7:04 AM, "Kelley, Sean" wrote: > Hi, > > We are experiencing a problem with the process allocation on our Open MPI > cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband > drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The > hardware consists of a head node and N blades on private ethernet and > infiniband networks. > > The command run for these tests is a simple MPI program (called 'hn') which > prints out the rank and the hostname. The hostname for the head node is 'head' > and the compute nodes are '.0' ... '.9'. > > We are using the following hostfiles for this example: > > hostfile7 > -1 max_slots=1 > 0 max_slots=3 > 1 max_slots=3 > > hostfile8 > -1 max_slots=2 > 0 max_slots=3 > 1 max_slots=3 > > hostfile9 > -1 max_slots=3 > 0 max_slots=3 > 1 max_slots=3 > > running the following commands: > > orterun --hostfile hostfile7 -np 7 ./hn > orterun --hostfile hostfile8 -np 8 ./hn > orterun --byslot --hostfile hostfile7 -np 7 ./hn > orterun --byslot --hostfile hostfile8 -np 8 ./hn > > causes orterun to crash. However, > > orterun --hostfile hostfile9 -np 9 ./hn > ortetrun --byslot --hostfile hostfile9 -np 9 ./hn > > works outputing the following: > > 0 head > 1 head > 2 head > 3 .0 > 4 .0 > 5 .0 > 6 .0 > 7 .0 > 8 .0 > > However, running the following: > > orterun --bynode --hostfile hostfile7 -np 7 ./hn > > works, outputing the following > > 0 head > 1 .0 > 2 .1 > 3 .0 > 4 .1 > 5 .0 > 6 .1 > > Is the '--byslot' crash a known problem? Does it have something to do with > BPROC? Thanks in advance for any assistance! > > Sean > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] orterun --bynode/--byslot problem
Hi, We are experiencing a problem with the process allocation on our Open MPI cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The hardware consists of a head node and N blades on private ethernet and infiniband networks. The command run for these tests is a simple MPI program (called 'hn') which prints out the rank and the hostname. The hostname for the head node is 'head' and the compute nodes are '.0' ... '.9'. We are using the following hostfiles for this example: hostfile7 -1 max_slots=1 0 max_slots=3 1 max_slots=3 hostfile8 -1 max_slots=2 0 max_slots=3 1 max_slots=3 hostfile9 -1 max_slots=3 0 max_slots=3 1 max_slots=3 running the following commands: orterun --hostfile hostfile7 -np 7 ./hn orterun --hostfile hostfile8 -np 8 ./hn orterun --byslot --hostfile hostfile7 -np 7 ./hn orterun --byslot --hostfile hostfile8 -np 8 ./hn causes orterun to crash. However, orterun --hostfile hostfile9 -np 9 ./hn ortetrun --byslot --hostfile hostfile9 -np 9 ./hn works outputing the following: 0 head 1 head 2 head 3 .0 4 .0 5 .0 6 .0 7 .0 8 .0 However, running the following: orterun --bynode --hostfile hostfile7 -np 7 ./hn works, outputing the following 0 head 1 .0 2 .1 3 .0 4 .1 5 .0 6 .1 Is the '--byslot' crash a known problem? Does it have something to do with BPROC? Thanks in advance for any assistance! Sean <>
Re: [OMPI users] mpirun hanging when processes started on head node
Ralph, Thanks for the quick response, clarifications below. Sean From: users-boun...@open-mpi.org on behalf of Ralph H Castain Sent: Mon 6/11/2007 3:49 PM To: Open MPI Users Subject: Re: [OMPI users] mpirun hanging when processes started on head node Hi Sean Could you please clarify something? I'm a little confused by your comments about where things are running. I'm assuming that you mean everything works fine if you type the mpirun command on the head node and just let it launch on your compute nodes - that the problems only occur when you specifically tell mpirun you want processes on the head node as well (or exclusively). Is that correct? [Sean] This is correct. There are several possible sources of trouble, if I have understood your situation correctly. Our bproc support is somewhat limited at the moment - you may be encountering one of those limits. We currently have bproc support focused on the configuration here at Los Alamos National Lab as (a) that is where the bproc-related developers are working, and (b) it is the only regular test environment we have to work with for bproc. We don't normally use bproc in combination with hostfiles, so I'm not sure if there is a problem in that combination. I can investigate that a little later this week. [Sean] If it is helpful, running 'export NODES=-1; mpirun -np 1 hostname' exibits identical behaviour. Similarly, we require that all the nodes being used must be accessible via the same launch environment. It sounds like we may be able to launch processes on your head node via rsh, but not necessarily bproc. You might check to ensure that the head node will allow bproc-based process launch (I know ours don't - all jobs are run solely on the compute nodes. I believe that is generally the case). We don't currently support mixed environments, and I honestly don't expect that to change anytime soon. [Sean] I'm working through the strace output to follow the progression on the head node. It looks like mpirun consults '/bpfs/self' and determines that the request is to be run on the local machine so it fork/execs 'orted' which then runs 'hostname'. 'mpirun' didn't consult '/bpfs' or utilize 'rsh' after the determination to run on the local machine was made. When the 'hostname' command completes, 'orted' receives the SIGCHLD signal, performs some work and then both 'mpirun' and 'orted' go into what appears to be a poll() waiting for events. Hope that helps at least a little. [Sean] I appreciate the help. We are running processes on the head node because the head node is the only node which can access external resources (storage devices). Ralph On 6/11/07 1:04 PM, "Kelley, Sean" wrote: I forgot to add that we are using 'bproc'. Launching processes on the compute nodes using bproc works well, I'm not sure if bproc is involved when processes are launched on the local node. Sean From: users-boun...@open-mpi.org on behalf of Kelley, Sean Sent: Mon 6/11/2007 2:07 PM To: us...@open-mpi.org Subject: [OMPI users] mpirun hanging when processes started on head node Hi, We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect. When we use 'mpirun' from the OFED/Open MPI distribution to start processes on the compute nodes, everything works correctly. However, when we try to start processes on the head node, the processes appear to run correctly but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' file contains detailed information from running the following command: mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname where 'hostfile1' contains the following: -1 slots=2 max_slots=2 The 'run.log' is the output of the above line. The 'strace.out.0' is the result of 'strace -f' on the mpirun process (and the 'hostname' child process since mpirun simply forks the local processes). The child process (pid 23415 in this case) runs to completion and exits successfully. The parent process (mpirun) doesn't appear to recognize that the child has completed and hangs until killed (with a ^c). Additionally, when we run a set of processes which span the headnode and the compute nodes, t
Re: [OMPI users] mpirun hanging when processes started on head node
I forgot to add that we are using 'bproc'. Launching processes on the compute nodes using bproc works well, I'm not sure if bproc is involved when processes are launched on the local node. Sean From: users-boun...@open-mpi.org on behalf of Kelley, Sean Sent: Mon 6/11/2007 2:07 PM To: us...@open-mpi.org Subject: [OMPI users] mpirun hanging when processes started on head node Hi, We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect. When we use 'mpirun' from the OFED/Open MPI distribution to start processes on the compute nodes, everything works correctly. However, when we try to start processes on the head node, the processes appear to run correctly but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' file contains detailed information from running the following command: mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname where 'hostfile1' contains the following: -1 slots=2 max_slots=2 The 'run.log' is the output of the above line. The 'strace.out.0' is the result of 'strace -f' on the mpirun process (and the 'hostname' child process since mpirun simply forks the local processes). The child process (pid 23415 in this case) runs to completion and exits successfully. The parent process (mpirun) doesn't appear to recognize that the child has completed and hangs until killed (with a ^c). Additionally, when we run a set of processes which span the headnode and the compute nodes, the processes on the head node complete successfully, but the processes on the compute nodes do not appear to start. mpirun again appears to hang. Do I have a configuration error or is there a problem that I have encountered? Thank you in advance for your assistance or suggestions Sean -- Sean M. Kelley sean.kel...@solers.com
[OMPI users] mpirun hanging when processes started on head node
Hi, We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect. When we use 'mpirun' from the OFED/Open MPI distribution to start processes on the compute nodes, everything works correctly. However, when we try to start processes on the head node, the processes appear to run correctly but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' file contains detailed information from running the following command: mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname where 'hostfile1' contains the following: -1 slots=2 max_slots=2 The 'run.log' is the output of the above line. The 'strace.out.0' is the result of 'strace -f' on the mpirun process (and the 'hostname' child process since mpirun simply forks the local processes). The child process (pid 23415 in this case) runs to completion and exits successfully. The parent process (mpirun) doesn't appear to recognize that the child has completed and hangs until killed (with a ^c). Additionally, when we run a set of processes which span the headnode and the compute nodes, the processes on the head node complete successfully, but the processes on the compute nodes do not appear to start. mpirun again appears to hang. Do I have a configuration error or is there a problem that I have encountered? Thank you in advance for your assistance or suggestions Sean -- Sean M. Kelley sean.kel...@solers.com run1.tgz Description: run1.tgz