Re: [OMPI users] memory leak in alltoallw

2008-08-18 Thread Dave Grote





Great! Thanks for the fix.
   Dave

Tim Mattox wrote:

  The fix for this bug is in the 1.2 branch as of r19360, and will be in the
upcoming 1.2.7 release.

On Sun, Aug 17, 2008 at 6:10 PM, George Bosilca  wrote:
  
  
Dave,

Thanks for your report. As you discovered we had a memory leak in the
MPI_Alltoallw. A very small one, but it was there. Basically, we didn't
release two internal arrays of data-types, used to convert from the Fortran
data-types (as supplied by the user) to their C version (as required by the
implementation of the alltoallw function).

The good news is that this should not a problem anymore. Commit 19314 fix
this for the trunk, while commit 19315 fix it for the upcoming 1.3.

 Thanks again for your report.
   george.

On Aug 7, 2008, at 1:21 AM, Dave Grote wrote:



  Hi,
I've been enhancing my code and have started using the nice routine
alltoallw. The code works fine except that there seems to be a memory leak
in alltoallw. I've eliminated all other possible causes and have reduced the
code down to a bare minimum. I've included fortran source code which
produces the problem. This code just keeps calling alltoallw, but with all
of the send and receive counts set to zero, so it shouldn't be doing
anything. And yet I can watch the memory continue to grow. As a sanity
check, I change the code to call alltoallv instead, and there is no memory
leak. If it helps, I am using OpenMPI on an AMD system running Chaos linux.
I tried the latest nightly build of version 1.3 from Aug 5. I run four
processors on one quad core node so it should be using shared memory
communication.
 Thanks!
   Dave

   program testalltoallw
   real(kind=8):: phi(-3:3200+3)
   real(kind=8):: phi2(-3:3200+3)
   integer(4):: izproc,ii
   integer(4):: nzprocs
   integer(4):: zrecvtypes(0:3),zsendtypes(0:3)
   integer(4):: zsendcounts(0:3),zrecvcounts(0:3)
   integer(4):: zdispls(0:3)
   integer(4):: mpierror
   include "mpif.h"
   phi = 0.

   call MPI_INIT(mpierror)
   call MPI_COMM_SIZE(MPI_COMM_WORLD,nzprocs,mpierror)
   call MPI_COMM_RANK(MPI_COMM_WORLD,izproc,mpierror)

   zsendcounts=0
   zrecvcounts=0
   zdispls=0
   zdispls=0
   zsendtypes=MPI_DOUBLE_PRECISION
   zrecvtypes=MPI_DOUBLE_PRECISION

   do ii=1,10
 if (mod(ii,100_4) == 0) print*,"loop ",ii,izproc

 call MPI_ALLTOALLW(phi,zsendcounts,zdispls,zsendtypes,
  & phi2,zrecvcounts,zdispls,zrecvtypes,
  & MPI_COMM_WORLD,mpierror)

   enddo
   return
   end

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  
  


  





[OMPI users] memory leak in alltoallw

2008-08-06 Thread Dave Grote


Hi,
 I've been enhancing my code and have started using the nice routine 
alltoallw. The code works fine except that there seems to be a memory 
leak in alltoallw. I've eliminated all other possible causes and have 
reduced the code down to a bare minimum. I've included fortran source 
code which produces the problem. This code just keeps calling alltoallw, 
but with all of the send and receive counts set to zero, so it shouldn't 
be doing anything. And yet I can watch the memory continue to grow. As a 
sanity check, I change the code to call alltoallv instead, and there is 
no memory leak. If it helps, I am using OpenMPI on an AMD system running 
Chaos linux. I tried the latest nightly build of version 1.3 from Aug 5. 
I run four processors on one quad core node so it should be using shared 
memory communication.

  Thanks!
 Dave

 program testalltoallw
 real(kind=8):: phi(-3:3200+3)
 real(kind=8):: phi2(-3:3200+3)
 integer(4):: izproc,ii
 integer(4):: nzprocs
 integer(4):: zrecvtypes(0:3),zsendtypes(0:3)
 integer(4):: zsendcounts(0:3),zrecvcounts(0:3)
 integer(4):: zdispls(0:3)
 integer(4):: mpierror
 include "mpif.h"
 phi = 0.

 call MPI_INIT(mpierror)
 call MPI_COMM_SIZE(MPI_COMM_WORLD,nzprocs,mpierror)
 call MPI_COMM_RANK(MPI_COMM_WORLD,izproc,mpierror)

 zsendcounts=0
 zrecvcounts=0
 zdispls=0
 zdispls=0
 zsendtypes=MPI_DOUBLE_PRECISION
 zrecvtypes=MPI_DOUBLE_PRECISION

 do ii=1,10
   if (mod(ii,100_4) == 0) print*,"loop ",ii,izproc

   call MPI_ALLTOALLW(phi,zsendcounts,zdispls,zsendtypes,
& phi2,zrecvcounts,zdispls,zrecvtypes,
& MPI_COMM_WORLD,mpierror)

 enddo
 return
 end



Re: [OMPI users] Problem with X forwarding

2008-06-09 Thread Dave Grote


I have this same issue from a while ago. Search for "x11 forwarding" in 
the archives. The solution I settled on is to use the -d option, the 
debug option. With this option, mpirun will keep the ssh sessions open, 
and so the X forwarding stays active. Note that you do get lots of 
debugging output at the start of the run, but after that, there's no 
extra output. An enhancement ticket was going to be added to add a 
command line option to keep the ssh sessions open (without having to 
turn debugging on). I never heard anything more on it, so apparently 
nothing happened. But using the -d option does work well and doesn't 
require any extra fiddling.

  Dave

Allen Barnett wrote:

If you are using a recent version of Linux (as machine A), the X server
is probably started with its TCP network connection turned off. For
example, if you do:

$ ps auxw | grep X
/usr/bin/Xorg :0 -br -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7

The "-nolisten tcp" option turns off the X server's remote connection
socket. Also, "netstat -atp" on A will show that nothing is listening on
port 6000. So, for example, from machine B:

[B]$ xlogo -display A:0

doesn't work.

The trick I've used: Before you run your MPI application, you can ssh to
the remote node with X forwarding enabled ("ssh -Y"). On the remote
system, do "echo $DISPLAY" to see what DISPLAY environment variable ssh
created. For example, it might be something like "localhost:10.0". Leave
this ssh connection open and then run your OMPI application in another
window and pass "-x DISPLAY=localhost:10.0" through MPI. X applications
on the remote node *should* now be able to connect back through the open
ssh connection. This probably won't scale very well, though.

Allen

On Wed, 2008-06-04 at 14:36 -0400, Jeff Squyres wrote:
  
In general, Open MPI doesn't have anything to do with X forwarding.   
However, if you're using ssh to startup your processes, ssh may  
configure X forwarding for you (depending on your local system  
setup).  But OMPI closes down ssh channels once applications have  
launched (there's no need to keep them open), so any X forwarding that  
may have been setup will be closed down.


The *easiest* way to setup X forwarding is simply to allow X  
connections to your local host from the node(s) that will be running  
your application.  E.g., use the "xhost" command to add the target  
nodes into the access list.  And then have mpirun export a suitable  
DISPLAY variable, such as:


export DISPLAY=my_hostname:0
mpirun -x DISPLAY ...

The "-x DISPLAY" clause tells Open MPI to export the value of the  
DISPLAY variable to all nodes when running your application.


Hope this helps.


On May 30, 2008, at 1:24 PM, Cally K wrote:


hi, I have some problem running DistributedData.cxx ( it is a VTK  
file ) , I need to be able to see the rendering from my computer


I, however have problem running the executable, I loaded both the  
executabe into 2 machines


and I am accesing it from my computer( DHCP enabled )

after running the following command - I use OpenMPI

mpirun -hostfile myhostfile -np 2 -bynode ./DistributedData

and I keep getting these errors

ERROR: In /home/kalpanak/Installation_Files/VTKProject/VTK/Rendering/ 
vtkXOpenGLRenderWindow.cxx, line 326

vtkXOpenGLRenderWindow (0x8664438): bad X server connection.


ERROR: In /home/kalpanak/Installation_Files/VTKProject/VTK/Rendering/ 
vtkXOpenGLRenderWindow.cxx, line 169

vtkXOpenGLRenderWindow (0x8664438): bad X server connection.


[vrc1:27394] *** Process received signal ***
[vrc1:27394] Signal: Segmentation fault (11)
[vrc1:27394] Signal code: Address not mapped (1)
[vrc1:27394] Failing at address: 0x84
[vrc1:27394] [ 0] [0xe440]
[vrc1:27394] [ 1] ./ 
DistributedData(_ZN22vtkXOpenGLRenderWindow20GetDesiredVisualInfoEv 
+0x229) [0x8227e7d]
[vrc1:27394] [ 2] ./ 
DistributedData(_ZN22vtkXOpenGLRenderWindow16WindowInitializeEv 
+0x340) [0x8226812]
[vrc1:27394] [ 3] ./ 
DistributedData(_ZN22vtkXOpenGLRenderWindow10InitializeEv+0x29)  
[0x82234f9]
[vrc1:27394] [ 4] ./ 
DistributedData(_ZN22vtkXOpenGLRenderWindow5StartEv+0x29) [0x82235eb]
[vrc1:27394] [ 5] ./ 
DistributedData(_ZN15vtkRenderWindow14DoStereoRenderEv+0x1a)  
[0x82342ac]
[vrc1:27394] [ 6] ./ 
DistributedData(_ZN15vtkRenderWindow10DoFDRenderEv+0x427) [0x8234757]
[vrc1:27394] [ 7] ./ 
DistributedData(_ZN15vtkRenderWindow10DoAARenderEv+0x5b7) [0x8234d19]
[vrc1:27394] [ 8] ./DistributedData(_ZN15vtkRenderWindow6RenderEv 
+0x690) [0x82353b4]
[vrc1:27394] [ 9] ./ 
DistributedData(_ZN22vtkXOpenGLRenderWindow6RenderEv+0x52) [0x82245e2]

[vrc1:27394] [10] ./DistributedData [0x819e355]
[vrc1:27394] [11] ./ 
DistributedData(_ZN16vtkMPIController19SingleMethodExecuteEv+0x1ab)  
[0x837a447]

[vrc1:27394] [12] ./DistributedData(main+0x180) [0x819de78]
[vrc1:27394] [13] /lib/libc.so.6(__libc_start_main+0xe0) [0xb79c0fe0]
[vrc1:27394] [14] ./DistributedData [0x819dc21]
[vrc1:27394] *** End of error message ***
mpirun noticed that job ra

Re: [OMPI users] More on AlltoAll

2008-03-20 Thread Dave Grote





Sorry - my mistake - I meant AlltoAllV, which is what I use in my code.

Ashley Pittman wrote:

  On Thu, 2008-03-20 at 10:27 -0700, Dave Grote wrote:
  
  
After reading the previous discussion on AllReduce and AlltoAll, I 
thought I would ask my question. I have a case where I have data 
unevenly distributed among the processes (unevenly means that the 
processes have differing amounts of data) that I need to globally 
redistribute, resulting in a different uneven distribution. Writing the 
code to do the redistribution using AlltoAll is straightforward.

The problem though is that there are often special cases where each 
process only needs to exchange data with it neighbors. So the question 
is that when two processors don't have data to exchange, is the OpenMPI 
AlltoAll is written in such a way so that they don't do any 
communication? Will the AlltoAll be as efficient (or at least nearly as 
efficient) as direct send/recv among neighbors?

  
  
AlltoAll takes a single size of message and communictes that amount of
data from everybody to everybody.  You might want to look at AlltoAllw
and AlltoAllv, neither of which I have any experience of however.

Ashley,

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  





[OMPI users] More on AlltoAll

2008-03-20 Thread Dave Grote


After reading the previous discussion on AllReduce and AlltoAll, I 
thought I would ask my question. I have a case where I have data 
unevenly distributed among the processes (unevenly means that the 
processes have differing amounts of data) that I need to globally 
redistribute, resulting in a different uneven distribution. Writing the 
code to do the redistribution using AlltoAll is straightforward.


The problem though is that there are often special cases where each 
process only needs to exchange data with it neighbors. So the question 
is that when two processors don't have data to exchange, is the OpenMPI 
AlltoAll is written in such a way so that they don't do any 
communication? Will the AlltoAll be as efficient (or at least nearly as 
efficient) as direct send/recv among neighbors?

 Thanks!
   Dave


Re: [OMPI users] x11 forwarding

2006-12-04 Thread Dave Grote
Title: Re: [OMPI users] x11 forwarding





OK - I'll live with it for now. Fortunately, the extra output only
occurs at the start and end of the run and doesn't interfere with the
output of my code.

An obvious suggestion for when you get to revamping that part of the
code is to add a new command line flag to keep the ssh sessions running
without turning on the debugging output. I know that others have the
same XForwarding problem and this would offer a general solution.
   Thanks for all of your help!!
  Dave

Ralph Castain wrote:

  
  I’m
afraid that would be a rather significant job as it plays a rather
significant role in the ssh startup procedure. We have plans to revamp
that portion of the code, but without someone who knows exactly what is
going on and where, you are more likely to break it than revise it.
  
If you can live with it as-is for now, I would strongly suggest doing
so until we get back to that area.
  
Just my $0.02.
Ralph
  
  
  
On 12/1/06 4:51 PM, "Dave Grote"  wrote:
  
  
  
Is there a place where I can hack the openmpi code to force it to keep
the ssh sessions open without the -d option? I looked through some of
the code, including orterun.c and a few other places, but don't have
the familiarity with the code to find the place.
  Thanks!
     Dave

Galen Shipman wrote: 

-d leaves the ssh session open 
Try using:  
 
  
 
 
mpirun -d -host boxtop2 -mca pls_rsh_agent "ssh -X -n" xterm -e cat 
 
  
 
 
  
 
 
Note the "ssh -X -n", this will tell ssh not to open stdin.. 
 
  
 
 
You should then be able to type characters in the resulting xterm and
have them echo'd back correctly. 
 
  
 
 
- Galen 
 
 
 
 
  
 
 
On Dec 1, 2006, at 11:48 AM, Dave Grote wrote:
 
 
  
   
Thanks for the suggestion, but it doesn't fix my problem. I did the
same thing you did and was able to get xterms open when using the -d
option. But when I run my code, the -d option seems to play havoc with
stdin. My code normally reads stdin from one processor and it
broadcasts it to the others. This failed when using the -d option and
the code wouldn't take input commands properly.
 
But, since -d did get the X windows working, it must be doing something
differently. What is it about the -d option that allows the windows to
open? If I knew that, it would be the fix to my problem.
   Dave
 
Galen Shipman wrote: 

 
  
 
I think this might be as simple as adding "-d" to the mpirun command
line 
  
 
 
If I run: 
 
  
 
 
mpirun  -np 2 -d -mca pls_rsh_agent "ssh -X"   xterm -e gdb ./mpi-ping
 
 
 
 
  
 
 
  
 
 
All is well, I get the xterm's up.. 
 
  
 
 
If I run: 
 
  
 
 
mpirun  -np 2 -mca pls_rsh_agent "ssh -X"   xterm -e gdb ./mpi-ping 
 
  
 
 
I get the following: 
 
  
 
 
/usr/bin/xauth:  error in locking authority file
/home/gshipman/.Xauthority
 
xterm Xt error: Can't open display: localhost:10.0
 
  
 
 
  
 
 
Have you tried adding "-d"?
 
  
 
 
  
 
 
Thanks, 
 
      
 
 
Galen 
 
  
 
 
  
 
 
  
 
 
  
 
 
On Nov 30, 2006, at 2:42 PM, Dave Grote wrote:
 
  
 
 
  
 
 
  
 
 
 
  
   
I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
  Dave
 
Ralph H Castain wrote: 

 Actually, I believe at least some of this
may be a bug on our part. We currently pickup the local environment and
forward it on to the remote nodes as the environment for use by the
backend processes. I have seen quite a few environment variables in
that list, including DISPLAY, which would create the problem you are
seeing.
 
I’ll have to chat with folks here to understand what part of the
environment we absolutely need to carry forward, and what parts we need
to “cleanse” before passing it along.
 
Ralph
 
 
On 11/30/06 10:50 AM, "Dave Grote"  
 wrote:
 
  
  
  
I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried &

Re: [OMPI users] x11 forwarding

2006-12-01 Thread Dave Grote





Is there a place where I can hack the openmpi code to force it to keep
the ssh sessions open without the -d option? I looked through some of
the code, including orterun.c and a few other places, but don't have
the familiarity with the code to find the place.
  Thanks!
     Dave

Galen Shipman wrote:
-d leaves the ssh session open
  Try using:  
  
  
  mpirun -d -host boxtop2 -mca pls_rsh_agent "ssh -X -n" xterm -e
cat 
  
  
  
  
  Note the "ssh -X -n", this will tell ssh not to open stdin.. 
  
  
  You should then be able to type characters in the resulting
xterm and have them echo'd back correctly. 
  
  
  - Galen 
   
  
  
  
  On Dec 1, 2006, at 11:48 AM, Dave Grote wrote:
  
   
Thanks for the suggestion, but it doesn't fix my problem. I did the
same thing you did and was able to get xterms open when using the -d
option. But when I run my code, the -d option seems to play havoc with
stdin. My code normally reads stdin from one processor and it
broadcasts it to the others. This failed when using the -d option and
the code wouldn't take input commands properly.

But, since -d did get the X windows working, it must be doing something
differently. What is it about the -d option that allows the windows to
open? If I knew that, it would be the fix to my problem.
   Dave

Galen Shipman wrote:

  
  
I think this might be as simple as adding "-d" to the mpirun command
line
  
  
  If I run: 
  
  
  mpirun  -np 2 -d -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping
  
  
  
  
  
  
  
  All is well, I get the xterm's up.. 
  
  
  If I run: 
  
  
  mpirun  -np 2 -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping 
  
  
  I get the following: 
  
  
  /usr/bin/xauth:  error in locking authority file
/home/gshipman/.Xauthority
  xterm Xt error: Can't open display: localhost:10.0
  
  
  
  
  Have you tried adding "-d"?
  
  
  
  
  Thanks, 
  
  
  Galen 
  
  
  
  
  
  
  
  
  On Nov 30, 2006, at 2:42 PM, Dave Grote wrote:
  
  
  
  
  
  
  
   
I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
  Dave

Ralph H Castain wrote:

  Actually, I believe at least some of this may
be a bug on our part. We currently pickup the local environment and
forward it on to the remote nodes as the environment for use by the
backend processes. I have seen quite a few environment variables in
that list, including DISPLAY, which would create the problem you are
seeing.
  
I’ll have to chat with folks here to understand what part of the
environment we absolutely need to carry forward, and what parts we need
to “cleanse” before passing it along.
  
Ralph
  
  
On 11/30/06 10:50 AM, "Dave Grote"  wrote:
  
  
  
I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried "ssh -Y" and it doesn't help. I've set
xhost for the slave nodes in my login shell on the master and that
didn't work. XForwarding is enabled on all of the nodes, so that's not
the problem.

I am able to get it to work by having wwmpirun do the command "ssh -X
node xclock" before starting the parallel program on that same
node, but this only works for the first person who logs into the master
and gets DISPLAY=localhost:10. When someone else tries to run a
parallel job, its seems that DISPLAY is set to localhost:10 on the
slaves and tries to forward through that other persons login with the
same display number and the connection is refused because of wrong
authentication. This seems like very odd behavior. I'm aware that this
may be an issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.
  Th

Re: [OMPI users] x11 forwarding

2006-12-01 Thread Dave Grote





Success! The -n option on ssh did the trick. Now, the question is is
there a way of leaving the ssh sessions open without doing the
debugging? With the debugging part on, it prints out lots of stuff that
I don't want to have to see every time I run my code. I know, I'm just
being picky.
   Thanks!
  Dave

Galen Shipman wrote:
-d leaves the ssh session open
  Try using:  
  
  
  mpirun -d -host boxtop2 -mca pls_rsh_agent "ssh -X -n" xterm -e
cat 
  
  
  
  
  Note the "ssh -X -n", this will tell ssh not to open stdin.. 
  
  
  You should then be able to type characters in the resulting
xterm and have them echo'd back correctly. 
  
  
  - Galen 
   
  
  
  
  On Dec 1, 2006, at 11:48 AM, Dave Grote wrote:
  
   
Thanks for the suggestion, but it doesn't fix my problem. I did the
same thing you did and was able to get xterms open when using the -d
option. But when I run my code, the -d option seems to play havoc with
stdin. My code normally reads stdin from one processor and it
broadcasts it to the others. This failed when using the -d option and
the code wouldn't take input commands properly.

But, since -d did get the X windows working, it must be doing something
differently. What is it about the -d option that allows the windows to
open? If I knew that, it would be the fix to my problem.
   Dave

Galen Shipman wrote:

  
  
I think this might be as simple as adding "-d" to the mpirun command
line
  
  
  If I run: 
  
  
  mpirun  -np 2 -d -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping
  
  
  
  
  
  
  
  All is well, I get the xterm's up.. 
  
  
  If I run: 
  
  
  mpirun  -np 2 -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping 
  
  
  I get the following: 
  
  
  /usr/bin/xauth:  error in locking authority file
/home/gshipman/.Xauthority
  xterm Xt error: Can't open display: localhost:10.0
  
  
  
  
  Have you tried adding "-d"?
  
  
  
  
  Thanks, 
  
  
  Galen 
  
  
  
  
  
  
  
  
  On Nov 30, 2006, at 2:42 PM, Dave Grote wrote:
  
  
  
  
  
  
  
   
I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
  Dave

Ralph H Castain wrote:

  Actually, I believe at least some of this may
be a bug on our part. We currently pickup the local environment and
forward it on to the remote nodes as the environment for use by the
backend processes. I have seen quite a few environment variables in
that list, including DISPLAY, which would create the problem you are
seeing.
  
I’ll have to chat with folks here to understand what part of the
environment we absolutely need to carry forward, and what parts we need
to “cleanse” before passing it along.
  
Ralph
  
  
On 11/30/06 10:50 AM, "Dave Grote"  wrote:
  
  
  
I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried "ssh -Y" and it doesn't help. I've set
xhost for the slave nodes in my login shell on the master and that
didn't work. XForwarding is enabled on all of the nodes, so that's not
the problem.

I am able to get it to work by having wwmpirun do the command "ssh -X
node xclock" before starting the parallel program on that same
node, but this only works for the first person who logs into the master
and gets DISPLAY=localhost:10. When someone else tries to run a
parallel job, its seems that DISPLAY is set to localhost:10 on the
slaves and tries to forward through that other persons login with the
same display number and the connection is refused because of wrong
authentication. This seems like very odd behavior. I'm aware that this
may be an issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.

Re: [OMPI users] x11 forwarding

2006-12-01 Thread Dave Grote





Thanks for the suggestion, but it doesn't fix my problem. I did the
same thing you did and was able to get xterms open when using the -d
option. But when I run my code, the -d option seems to play havoc with
stdin. My code normally reads stdin from one processor and it
broadcasts it to the others. This failed when using the -d option and
the code wouldn't take input commands properly.

But, since -d did get the X windows working, it must be doing something
differently. What is it about the -d option that allows the windows to
open? If I knew that, it would be the fix to my problem.
   Dave

Galen Shipman wrote:

  
  
I think this might be as simple as adding "-d" to the mpirun command
line
  
  
  If I run: 
  
  
  mpirun  -np 2 -d -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping
  
  
  
  
  
  
  
  All is well, I get the xterm's up.. 
  
  
  If I run: 
  
  
  mpirun  -np 2 -mca pls_rsh_agent "ssh -X"   xterm -e gdb
./mpi-ping 
  
  
  I get the following: 
  
  
  /usr/bin/xauth:  error in locking authority file
/home/gshipman/.Xauthority
  xterm Xt error: Can't open display: localhost:10.0
  
  
  
  
  Have you tried adding "-d"?
  
  
  
  
  Thanks, 
  
  
  Galen 
  
  
  
  
  
  
  
  
  On Nov 30, 2006, at 2:42 PM, Dave Grote wrote:
  
  
  
  
  
  
  
   
I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
  Dave

Ralph H Castain wrote:
 Actually,
I believe at least some of this may be a bug on our part. We currently
pickup the local environment and forward it on to the remote nodes as
the environment for use by the backend processes. I have seen quite a
few environment variables in that list, including DISPLAY, which would
create the problem you are seeing.
  
I’ll have to chat with folks here to understand what part of the
environment we absolutely need to carry forward, and what parts we need
to “cleanse” before passing it along.
  
Ralph
  
  
On 11/30/06 10:50 AM, "Dave Grote"  wrote:
  
  
  
I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried "ssh -Y" and it doesn't help. I've set
xhost for the slave nodes in my login shell on the master and that
didn't work. XForwarding is enabled on all of the nodes, so that's not
the problem.

I am able to get it to work by having wwmpirun do the command "ssh -X
node xclock" before starting the parallel program on that same
node, but this only works for the first person who logs into the master
and gets DISPLAY=localhost:10. When someone else tries to run a
parallel job, its seems that DISPLAY is set to localhost:10 on the
slaves and tries to forward through that other persons login with the
same display number and the connection is refused because of wrong
authentication. This seems like very odd behavior. I'm aware that this
may be an issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.
  Thanks!
 Dave

Galen Shipman wrote: 

 
what does your command line look like?
  
- Galen
  
On Nov 29, 2006, at 7:53 PM, Dave Grote wrote:
  
  
 
  
   
I cannot get X11 forwarding to work using mpirun. I've tried all of  
the
standard methods, such as setting pls_rsh_agent = ssh -X, using xhost,
and a few other things, but nothing works in general. In the FAQ,
http://www.open-mpi.org/faq/?category=running#mpirun-gui,
a  
reference is
made to other methods, but "they involve sophisticated X forwarding
through mpirun", and no further explanation is given. Can someone tell
me what these other methods are or point me to where I can find  
info on
them? I've done lots of google searching and havn't found anything
useful. This is a major issue since my parallel code heavily  
depends on
having the ability to open X windows on the remote machine. Any and  
all
help would b

Re: [OMPI users] x11 forwarding

2006-11-30 Thread Dave Grote
Title: Re: [OMPI users] x11 forwarding





I don't think that that is the problem. As far as I can tell, the
DISPLAY environment variable is being set properly on the slave (it
will sometimes have a different value than in the shell where mpirun
was executed).
  Dave

Ralph H Castain wrote:

  
  Actually,
I believe at least some of this may be a bug on our part. We currently
pickup the local environment and forward it on to the remote nodes as
the environment for use by the backend processes. I have seen quite a
few environment variables in that list, including DISPLAY, which would
create the problem you are seeing.
  
I’ll have to chat with folks here to understand what part of the
environment we absolutely need to carry forward, and what parts we need
to “cleanse” before passing it along.
  
Ralph
  
  
On 11/30/06 10:50 AM, "Dave Grote"  wrote:
  
  
  
I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried "ssh -Y" and it doesn't help. I've set
xhost for the slave nodes in my login shell on the master and that
didn't work. XForwarding is enabled on all of the nodes, so that's not
the problem.

I am able to get it to work by having wwmpirun do the command "ssh -X
node xclock" before starting the parallel program on that same
node, but this only works for the first person who logs into the master
and gets DISPLAY=localhost:10. When someone else tries to run a
parallel job, its seems that DISPLAY is set to localhost:10 on the
slaves and tries to forward through that other persons login with the
same display number and the connection is refused because of wrong
authentication. This seems like very odd behavior. I'm aware that this
may be an issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.
  Thanks!
 Dave

Galen Shipman wrote: 
    
 
what does your command line look like?
  
- Galen
  
On Nov 29, 2006, at 7:53 PM, Dave Grote wrote:
  
  
 
  
   
I cannot get X11 forwarding to work using mpirun. I've tried all of  
the
standard methods, such as setting pls_rsh_agent = ssh -X, using xhost,
and a few other things, but nothing works in general. In the FAQ,
http://www.open-mpi.org/faq/?category=running#mpirun-gui,
a  
reference is
made to other methods, but "they involve sophisticated X forwarding
through mpirun", and no further explanation is given. Can someone tell
me what these other methods are or point me to where I can find  
info on
them? I've done lots of google searching and havn't found anything
useful. This is a major issue since my parallel code heavily  
depends on
having the ability to open X windows on the remote machine. Any and  
all
help would be appreciated!
  Thanks!
 Dave
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 

   
  
___
users mailing list
us...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/users
  
  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  
  
  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] x11 forwarding

2006-11-30 Thread Dave Grote





I'm using caos linux (developed at LBL), which has the wrapper wwmpirun
around mpirun, so my command is something like
wwmpirun -np 8 -- -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
This is essentially the same as
mpirun -np 8 -x PYTHONPATH --mca pls_rsh_agent '"ssh -X"'
/usr/local/bin/pyMPI
but wwmpirun does the scheduling, for example looking for idle nodes
and creating the host file.
My system is setup with a master/login node which is running a full
version of linux and slave nodes that run a reduced linux (that
includes access to the X libraries). wwmmpirun always picks the slaves
nodes to run on. I've also tried "ssh -Y" and it doesn't help. I've set
xhost for the slave nodes in my login shell on the master and that
didn't work. XForwarding is enabled on all of the nodes, so that's not
the problem.

I am able to get it to work by having wwmpirun do the command "ssh -X
node xclock" before starting the parallel program on that same
node, but this only works for the first person who logs into the master
and gets DISPLAY=localhost:10. When someone else tries to run a
parallel job, its seems that DISPLAY is set to localhost:10 on the
slaves and tries to forward through that other persons login with the
same display number and the connection is refused because of wrong
authentication. This seems like very odd behavior. I'm aware that this
may be an issue with the X server (xorg) or with the version of linux,
so I am also seeking help from the person who maintains caos linux. If
it matters, the machine uses myrinet for the interconnects.
  Thanks!
 Dave

Galen Shipman wrote:

  what does your command line look like?

- Galen

On Nov 29, 2006, at 7:53 PM, Dave Grote wrote:

  
  
I cannot get X11 forwarding to work using mpirun. I've tried all of  
the
standard methods, such as setting pls_rsh_agent = ssh -X, using xhost,
and a few other things, but nothing works in general. In the FAQ,
http://www.open-mpi.org/faq/?category=running#mpirun-gui, a  
reference is
made to other methods, but "they involve sophisticated X forwarding
through mpirun", and no further explanation is given. Can someone tell
me what these other methods are or point me to where I can find  
info on
them? I've done lots of google searching and havn't found anything
useful. This is a major issue since my parallel code heavily  
depends on
having the ability to open X windows on the remote machine. Any and  
all
help would be appreciated!
  Thanks!
 Dave
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  
  
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  





[OMPI users] x11 forwarding

2006-11-29 Thread Dave Grote


I cannot get X11 forwarding to work using mpirun. I've tried all of the 
standard methods, such as setting pls_rsh_agent = ssh -X, using xhost, 
and a few other things, but nothing works in general. In the FAQ, 
http://www.open-mpi.org/faq/?category=running#mpirun-gui, a reference is 
made to other methods, but "they involve sophisticated X forwarding 
through mpirun", and no further explanation is given. Can someone tell 
me what these other methods are or point me to where I can find info on 
them? I've done lots of google searching and havn't found anything 
useful. This is a major issue since my parallel code heavily depends on 
having the ability to open X windows on the remote machine. Any and all 
help would be appreciated!

 Thanks!
Dave


[OMPI users] allreduce produces "error(8) registering gm memory"

2006-08-21 Thread Dave Grote


I have attached a small program that when run on my machine produces the 
error message below and locks up.


[node:06319] [mpool_gm_module.c:100] error(8) registering gm memory

I get the error when I run with 32 processors, but not with 4 (even if I 
increase the loop count to 2). This is on a cluster of dual-dual 
core opterons with myrinet switches (i.e. using the gm routines). 
Unfortunately, I don't have the configure options that were used to 
build openmpi, but I don't think there was anything unusual. I've also 
attached the open_info output. Here is the compile line for the code


g95 -o allreducetest allreducetest.F -I/usr/local/ompi/1.1-gcc/include 
-L/usr/local/ompi/1.1-gcc/lib -lmpi


Also note that I did have to make changes to the fortran include files 
in openmpi to force all of the integers to be of size 4 (i.e. declaring 
them integer(4)) since the default integer size used by g95 is 8 bytes 
but the openmpi fortran interface was compiled with f77 which uses 4 
byte integers.


Any suggestions on what to look for?
  Thanks for the help,
  Dave
  program parallel_sum_mmnts
  real(kind=8):: zmmnts(0:360,28,0:8)

c Use reduction routines to sum whole beam moments across all
c of the processors.  It also shares z moment data at PE boundaries.

c --- temporary for z moments
  real(kind=8),allocatable:: ztemp(:,:,:)
  integer(4):: nn,nslaves,my_index,ii
  include "mpif.h"
  integer(4):: mpierror

  call MPI_INIT(mpierror)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,nslaves,mpierror)
  call MPI_COMM_RANK(MPI_COMM_WORLD,my_index,mpierror)

  do ii=1,2

  print*,"PSM1 ",ii,my_index

  zmmnts0 = my_index
  zmmnts = my_index

  allocate(ztemp(0:360,28,0:8))

c --- Do reduction on beam z moments.
  ztemp = zmmnts
  nn = (1+360)*28*(1+8)
  print*,"PSM1 ",my_index,nn
  call MPI_ALLREDUCE(ztemp,zmmnts,nn,
 &   MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,mpierror)

  print*,"PSM2 ",my_index

  deallocate(ztemp)

  enddo

  stop
  end


oinfo.gz
Description: GNU Zip compressed data