Re: [OMPI users] Beowulf cluster and openmpi

2008-11-05 Thread Kelley, Sean
I would suggest making sure that the /etc/beowulf/config file has a
"libraries" line for every directory where required shared libraries
(application and mpi) are located.

Also, make sure that the filesystems containing the executables and
shared libraries are accessible from the compute nodes. 

Sean

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Rima Chaudhuri
Sent: Monday, November 03, 2008 5:50 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Beowulf cluster and openmpi

I added the option for -hostfile machinefile where the machinefile is a
file with the IP of the nodes:
#host names
192.168.0.100 slots=2
192.168.0.101 slots=2
192.168.0.102 slots=2
192.168.0.103 slots=2
192.168.0.104 slots=2
192.168.0.105 slots=2
192.168.0.106 slots=2
192.168.0.107 slots=2
192.168.0.108 slots=2
192.168.0.109 slots=2


[rchaud@helios amber10]$ ./step1

--
A daemon (pid 29837) launched by the bproc PLS component on node 192
died unexpectedly so we are aborting.

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

--
[helios.structure.uic.edu:29836] [0,0,0] ORTE_ERROR_LOG: Error in file
pls_bproc.c at line 717 [helios.structure.uic.edu:29836] [0,0,0]
ORTE_ERROR_LOG: Error in file pls_bproc.c at line 1164
[helios.structure.uic.edu:29836] [0,0,0] ORTE_ERROR_LOG: Error in file
rmgr_urm.c at line 462 [helios.structure.uic.edu:29836] mpirun: spawn
failed with errno=-1

I used bpsh to see if the master and one of the nodes n8 could see the
$LD_LIBRARY_PATH, and it does..

[rchaud@helios amber10]$ echo $LD_LIBRARY_PATH
/home/rchaud/openmpi-1.2.6/openmpi-1.2.6_ifort/lib

[rchaud@helios amber10]$ bpsh n8 echo $LD_LIBRARY_PATH
/home/rchaud/openmpi-1.2.6/openmpi-1.2.6_ifort/lib

thanks!


On Mon, Nov 3, 2008 at 3:14 PM,   wrote:
> Send users mailing list submissions to
>us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific 
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. Re: Problems installing in Cygwin - Problem with GCC  3.4.4
>  (Jeff Squyres)
>   2. switch from mpich2 to openMPI  (PattiMichelle)
>   3. Re: users Digest, Vol 1055, Issue 2 (Ralph Castain)
>
>
> --
>
> Message: 1
> Date: Mon, 3 Nov 2008 15:52:22 -0500
> From: Jeff Squyres 
> Subject: Re: [OMPI users] Problems installing in Cygwin - Problem with
>GCC 3.4.4
> To: "Gustavo Seabra" 
> Cc: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> On Nov 3, 2008, at 3:36 PM, Gustavo Seabra wrote:
>
>>> For your fortran issue, the Fortran 90 interface needs the Fortran 
>>> 77 interface.  So you need to supply an F77 as well (the output from

>>> configure should indicate that the F90 interface was disabled 
>>> because the F77 interface was disabled).
>>
>> Is that what you mean (see below)?
>
> Ah yes -- that's another reason the f90 interface could be disabled:
> if configure detects that the f77 and f90 compilers are not link- 
> compatible.
>
>> I thought the g95 compiler could
>> deal with F77 as well as F95... If so, could I just pass F77='g95'?
>
> That would probably work (F77=g95).  I don't know the g95 compiler at 
> all, so I don't know if it also accepts Fortran-77-style codes.  But 
> if it does, then you're set.  Otherwise, specify a different F77 
> compiler that is link compatible with g95 and you should be good.
 I looked in some places in the OpenMPI code, but I couldn't find 
 "max" being redefined anywhere, but I may be looking in the wrong 
 places. Anyways, the only way of found of compiling OpenMPI was a 
 very ugly hack: I have to go into those files and remove the 
 "std::"
 before
 the "max". With that, it all compiled cleanly.
>>>
>>> I'm not sure I follow -- I don't see anywhere in OMPI where we use 
>>> std::max.
>>> What areas did you find that you needed to change?
>>
>> These files are part of the standard C++ headers. In my case, they 
>> sit in:
>> /usr/lib/gcc/i686-pc-cygwin/3.4.4/include/c++/bits
>
> Ah, I see.
>
>> In principle, the problems that comes 

Re: [OMPI users] orterun --bynode/--byslot problem

2007-07-23 Thread Kelley, Sean
Would this logic be in the bproc pls component?
Sean



From: users-boun...@open-mpi.org on behalf of Ralph H Castain
Sent: Mon 7/23/2007 9:18 AM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] orterun --bynode/--byslot problem



No, byslot appears to be working just fine on our bproc clusters (it is the
default mode). As you probably know, bproc is a little strange in how we
launch - we have to launch the procs in "waves" that correspond to the
number of procs on a node.

In other words, the first "wave" launches a proc on all nodes that have at
least one proc on them. The second "wave" then launches another proc on all
nodes that have at least two procs on them, but doesn't launch anything on
any node that only has one proc on it.

My guess here is that the system for some reason is insisting that your head
node be involved in every wave. I confess that we have never tested (to my
knowledge) a mapping that involves "skipping" a node somewhere in the
allocation - we always just map from the beginning of the node list, with
the maximum number of procs being placed on the first nodes in the list
(since in our machines, the nodes are all the same, so who cares?). So it is
possible that something in the code objects to skipping around nodes in the
allocation.

I will have to look and see where that dependency might lie - will try to
get to it this week.

BTW: that patch I sent you for head node operations will be in 1.2.4.

Ralph



On 7/23/07 7:04 AM, "Kelley, Sean" <sean.kel...@solers.com> wrote:

> Hi,
> 
>  We are experiencing a problem with the process allocation on our Open MPI
> cluster. We are using Scyld 4.1 (BPROC), the OFED 1.2 Topspin Infiniband
> drivers, Open MPI 1.2.3 + patch (to run processes on the head node). The
> hardware consists of a head node and N blades on private ethernet and
> infiniband networks.
> 
> The command run for these tests is a simple MPI program (called 'hn') which
> prints out the rank and the hostname. The hostname for the head node is 'head'
> and the compute nodes are '.0' ... '.9'.
> 
> We are using the following hostfiles for this example:
> 
> hostfile7
> -1 max_slots=1
> 0 max_slots=3
> 1 max_slots=3
> 
> hostfile8
> -1 max_slots=2
> 0 max_slots=3
> 1 max_slots=3
> 
> hostfile9
> -1 max_slots=3
> 0 max_slots=3
> 1 max_slots=3
> 
> running the following commands:
> 
> orterun --hostfile hostfile7 -np 7 ./hn
> orterun --hostfile hostfile8 -np 8 ./hn
> orterun --byslot --hostfile hostfile7 -np 7 ./hn
> orterun --byslot --hostfile hostfile8 -np 8 ./hn
> 
> causes orterun to crash. However,
> 
> orterun --hostfile hostfile9 -np 9 ./hn
> ortetrun --byslot --hostfile hostfile9 -np 9 ./hn
> 
> works outputing the following:
> 
> 0 head
> 1 head
> 2 head
> 3 .0
> 4 .0
> 5 .0
> 6 .0
> 7 .0
> 8 .0
> 
> However, running the following:
> 
> orterun --bynode --hostfile hostfile7 -np 7 ./hn
> 
> works, outputing the following
> 
> 0 head
> 1 .0
> 2 .1
> 3 .0
> 4 .1
> 5 .0
> 6 .1
> 
> Is the '--byslot' crash a known problem? Does it have something to do with
> BPROC? Thanks in advance for any assistance!
> 
> Sean
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging when processes started on head node

2007-06-11 Thread Kelley, Sean
Ralph,
 Thanks for the quick response, clarifications below.
  Sean



From: users-boun...@open-mpi.org on behalf of Ralph H Castain
Sent: Mon 6/11/2007 3:49 PM
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] mpirun hanging when processes started on head node


Hi Sean

Could you please clarify something? I'm a little confused by your comments 
about where things are running. I'm assuming that you mean everything works 
fine if you type the mpirun command on the head node and just let it launch on 
your compute nodes - that the problems only occur when you specifically tell 
mpirun you want processes on the head node as well (or exclusively). Is that 
correct?

[Sean] This is correct.


There are several possible sources of trouble, if I have understood your 
situation correctly. Our bproc support is somewhat limited at the moment - you 
may be encountering one of those limits. We currently have bproc support 
focused on the configuration here at Los Alamos National Lab as (a) that is 
where the bproc-related developers are working, and (b) it is the only regular 
test environment we have to work with for bproc. We don't normally use bproc in 
combination with hostfiles, so I'm not sure if there is a problem in that 
combination. I can investigate that a little later this week.

[Sean] If it is helpful, running 'export NODES=-1; mpirun -np 1 hostname' 
exibits identical behaviour.

Similarly, we require that all the nodes being used must be accessible via the 
same launch environment. It sounds like we may be able to launch processes on 
your head node via rsh, but not necessarily bproc. You might check to ensure 
that the head node will allow bproc-based process launch (I know ours don't - 
all jobs are run solely on the compute nodes. I believe that is generally the 
case). We don't currently support mixed environments, and I honestly don't 
expect that to change anytime soon.


[Sean] I'm working through the strace output to follow the progression on the 
head node. It looks like mpirun consults '/bpfs/self' and determines that the 
request is to be run on the local machine so it fork/execs 'orted' which then 
runs 'hostname'. 'mpirun' didn't consult '/bpfs' or utilize 'rsh' after the 
determination to run on the local machine was made. When the 'hostname' command 
completes, 'orted' receives the SIGCHLD signal, performs some work and then 
both 'mpirun' and 'orted' go into what appears to be a poll() waiting for 
events.


Hope that helps at least a little.

[Sean] I appreciate the help. We are running processes on the head node because 
the head node is the only node which can access external resources (storage 
devices). 


Ralph





On 6/11/07 1:04 PM, "Kelley, Sean" <sean.kel...@solers.com> wrote:



I forgot to add that we are using 'bproc'. Launching processes on the 
compute nodes using bproc works well, I'm not sure if bproc is involved when 
processes are launched on the local node.

Sean




From: users-boun...@open-mpi.org on behalf of Kelley, Sean
Sent: Mon 6/11/2007 2:07 PM
To: us...@open-mpi.org
Subject: [OMPI users] mpirun hanging when processes started on head node

Hi,
  We are running the OFED 1.2rc4 distribution containing 
openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware 
configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as 
compute nodes using Cisco TopSpin Infiband HCAs and switches for the 
interconnect.

  When we use 'mpirun' from the OFED/Open MPI distribution to start 
processes on the compute nodes, everything works correctly. However, when we 
try to start processes on the head node, the processes appear to run correctly 
but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' 
file contains detailed information from running the following command:

 mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d 
hostname

where 'hostfile1' contains the following:

-1 slots=2 max_slots=2

The 'run.log' is the output of the above line. The 'strace.out.0' is 
the result of 'strace -f' on the mpirun process (and the 'hostname' child 
process since mpirun simply forks the local processes). The child process (pid 
23415 in this case) runs to completion and exits successfully. The parent 
process (mpirun) doesn't appear to recognize that the child has completed and 
hangs until killed (with a ^c). 

Additionally, when we run a set of processes which span the headnode 
and the compute nodes, the processes on the head node complete successfully, 
but the processes on the compute nodes do not appear to start. mpirun again 
appears to hang.

Do I have a configuration error

[OMPI users] mpirun hanging when processes started on head node

2007-06-11 Thread Kelley, Sean
Hi,
  We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a 
RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration 
consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes 
using Cisco TopSpin Infiband HCAs and switches for the interconnect.
 
   When we use 'mpirun' from the OFED/Open MPI distribution to start 
processes on the compute nodes, everything works correctly. However, when we 
try to start processes on the head node, the processes appear to run correctly 
but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' 
file contains detailed information from running the following command:
 
  mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname
 
where 'hostfile1' contains the following:
 
-1 slots=2 max_slots=2
 
The 'run.log' is the output of the above line. The 'strace.out.0' is the result 
of 'strace -f' on the mpirun process (and the 'hostname' child process since 
mpirun simply forks the local processes). The child process (pid 23415 in this 
case) runs to completion and exits successfully. The parent process (mpirun) 
doesn't appear to recognize that the child has completed and hangs until killed 
(with a ^c). 
 
Additionally, when we run a set of processes which span the headnode and the 
compute nodes, the processes on the head node complete successfully, but the 
processes on the compute nodes do not appear to start. mpirun again appears to 
hang.
 
Do I have a configuration error or is there a problem that I have encountered? 
Thank you in advance for your assistance or suggestions
 
Sean
 
--
Sean M. Kelley
sean.kel...@solers.com
 
 


run1.tgz
Description: run1.tgz