from:"jody"

Re: [OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used

2016-05-04 Thread jody

Actually all machines use iptables as firewall.

I compared the rules triops and kraken use and found that triops had the
line
  REJECT all  --  anywhere anywhere reject-with
icmp-host-prohibited
which kraken did not have (otherwise they were identical).
I removed that line from triops' rules, restarted iptables and now
communication works in all directions!

Thank You
  Jody

On Tue, May 3, 2016 at 7:00 PM, Jeff Squyres (jsquyres) 
wrote:

> Have you disabled firewalls between these machines?
>
> > On May 3, 2016, at 11:26 AM, jody  wrote:
> >
> > ...my bad!
> >
> > I had set up things so that PATH and LD_LIBRARY_PATH were correct in
> interactive mode,
> > but they were wrong ssh was called non-interactively.
> >
> > Now i have a new problem:
> > When i do
> >   mpirun -np 6 --hostfile krakenhosts hostname
> > from triops, sometimes it seems to hang (i.e. no output, doesn't end)
> > and at other time i get the ouput
> > 
> > [aim-kraken:24527] [[45056,0],1] tcp_peer_send_blocking: send() to
> socket 9 failed: Broken pipe (32)
> >
> --
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> > ...
> >
> --
> > -
> > Again, i can call mpirun on triops from kraken und all squid_XX without
> a problem...
> >
> > What could cause this problem?
> >
> > Thank You
> >   Jody
> >
> >
> > On Tue, May 3, 2016 at 2:54 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Have you verified that you are running the same version of Open MPI on
> both servers when launched from non-interactive logins?
> >
> > This kind of error is somewhat typical if you accidentally mixed, for
> example, Open MPI v1.6.x and v1.10.2 (i.e., v1.10.2 understands the
> --hnp-topo-sig back end option, but v1.6.x does not).
> >
> >
> > > On May 3, 2016, at 6:35 AM, jody  wrote:
> > >
> > > Hi
> > > I have installed Open MPI v 1.10.2 on two machines today using only
> the prefix-option for configure, and then doing 'make all install'.
> > >
> > > On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH
> correctly.
> > > (I checked by running 'mpirun --version' and verifying that the output
> does indeed say 1.10.2)
> > >
> > > Password-less ssh is enabled on both machines in both directions.
> > >
> > > When i start mpirun form one machine (kraken) with a hostfile
> specifying the other machine ("triops slots=8 max-slots=8),
> > > it works:
> > > -
> > > jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime
> > >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> > >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> > >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> > > -
> > >
> > > But when i start mpirun form triops with a hostfile specifying kraken
> ("kraken slots=8 max-slots=8"),
> > > it fails:
> > > -
> > > jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname
> > > [aim-kraken:21973] Error: unknown option "--hnp-topo-sig"
> > > input in flex scanner failed
> > >
> --
> > > ORTE was unable to reliably start one or more daemons.
> > > This usually is caused by:
> > >
> > > * not finding the required libraries and/or binaries on
> > >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> > >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> > >
> > > * lack of authority to execute on one or more specified nodes.
> > >   Please verify your allocation and authorities.
> > >
> > > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> > >   Please check with your sys admin to determine the correct location
> to use.
> > >
> > > *  compilation of the orted with dynamic libraries when static are
> required
> > >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> > >   one of the contrib/platform definitions for your system type.
> > >
> > > * an inability to create a connection back to mpirun due to a
> > >   lack of common network interface

Re: [OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used

2016-05-03 Thread jody

...my bad!

I had set up things so that PATH and LD_LIBRARY_PATH were correct in
interactive mode,
but they were wrong ssh was called non-interactively.

Now i have a new problem:
When i do
  mpirun -np 6 --hostfile krakenhosts hostname
from triops, sometimes it seems to hang (i.e. no output, doesn't end)
and at other time i get the ouput

[aim-kraken:24527] [[45056,0],1] tcp_peer_send_blocking: send() to socket 9
failed: Broken pipe (32)
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
...
--
-
Again, i can call mpirun on triops from kraken und all squid_XX without a
problem...

What could cause this problem?

Thank You
  Jody


On Tue, May 3, 2016 at 2:54 PM, Jeff Squyres (jsquyres) 
wrote:

> Have you verified that you are running the same version of Open MPI on
> both servers when launched from non-interactive logins?
>
> This kind of error is somewhat typical if you accidentally mixed, for
> example, Open MPI v1.6.x and v1.10.2 (i.e., v1.10.2 understands the
> --hnp-topo-sig back end option, but v1.6.x does not).
>
>
> > On May 3, 2016, at 6:35 AM, jody  wrote:
> >
> > Hi
> > I have installed Open MPI v 1.10.2 on two machines today using only the
> prefix-option for configure, and then doing 'make all install'.
> >
> > On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH
> correctly.
> > (I checked by running 'mpirun --version' and verifying that the output
> does indeed say 1.10.2)
> >
> > Password-less ssh is enabled on both machines in both directions.
> >
> > When i start mpirun form one machine (kraken) with a hostfile specifying
> the other machine ("triops slots=8 max-slots=8),
> > it works:
> > -
> > jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime
> >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> >  12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
> > -
> >
> > But when i start mpirun form triops with a hostfile specifying kraken
> ("kraken slots=8 max-slots=8"),
> > it fails:
> > -
> > jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname
> > [aim-kraken:21973] Error: unknown option "--hnp-topo-sig"
> > input in flex scanner failed
> >
> --
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to
> use.
> >
> > *  compilation of the orted with dynamic libraries when static are
> required
> >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> >   one of the contrib/platform definitions for your system type.
> >
> > * an inability to create a connection back to mpirun due to a
> >   lack of common network interfaces and/or no route found between
> >   them. Please check network connectivity (including firewalls
> >   and network routing requirements).
> >
> --
> >
> > The same error happens when i use '--host kraken'.
> >
> > I verified that PATH and LD_LIBRARY_PATH are correctly set on both
> machines.
> > And on both machines /tmp is readable, writeable and executable for all.
> > The connection should be okay (i can do a ssh from kraken to triops and
> vice versa).
> >
> > Any idea what the problem is?
> >
> > Thank You
> >   Jody
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29074.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29075.php
>

[OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used

2016-05-03 Thread jody

Hi
I have installed Open MPI v 1.10.2 on two machines today using only the
prefix-option for configure, and then doing 'make all install'.

On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH
correctly.
(I checked by running 'mpirun --version' and verifying that the output does
indeed say 1.10.2)

Password-less ssh is enabled on both machines in both directions.

When i start mpirun form one machine (kraken) with a hostfile specifying
the other machine ("triops slots=8 max-slots=8),
it works:
-
jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime
 12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
 12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
 12:24:04 up 7 days, 43 min, 17 users,  load average: 0.06, 0.68, 0.65
-

But when i start mpirun form triops with a hostfile specifying kraken
("kraken slots=8 max-slots=8"),
it fails:
-
jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname
[aim-kraken:21973] Error: unknown option "--hnp-topo-sig"
input in flex scanner failed
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

The same error happens when i use '--host kraken'.

I verified that PATH and LD_LIBRARY_PATH are correctly set on both machines.
And on both machines /tmp is readable, writeable and executable for all.
The connection should be okay (i can do a ssh from kraken to triops and
vice versa).

Any idea what the problem is?

Thank You
  Jody

Re: [OMPI users] run a program

2014-02-26 Thread jody

Hi Raha
Yes, that is correct.
You have to make sure that max-slots is less or equal to the number of cpus
in the node to avoid oversubscribing it.

Have a look at the other entries in the FAQ,  they give information on many
other options you can use.
   http://www.open-mpi.org/faq/?category=running

Jody


On Wed, Feb 26, 2014 at 10:38 AM, raha khalili wrote:

> Dear Jody
>
> Thank you for your reply. Based on hostfile examples you show me, I
> understand 'slots' is number of cpus of each node I mentioned in the file,
> am I true?
>
> Wishes
>
>
> On Wed, Feb 26, 2014 at 1:02 PM, jody  wrote:
>
>> Hi
>> I think you should use the "--host" or "--hostfile" options:
>>   http://www.open-mpi.org/faq/?category=running#simple-spmd-run
>>   http://www.open-mpi.org/faq/?category=running#mpirun-host
>> Hope this helps
>>   Jody
>>
>>
>> On Wed, Feb 26, 2014 at 8:31 AM, raha khalili 
>> wrote:
>>
>>>  Dear Users
>>>
>>> This is my first post in open-mpi forum and I am beginner in using mpi.
>>> I want to run a program which does between 4 systems consist of one
>>> server and three nodes with 20 cpus. When I run: *mpirun -np 20
>>> /home/khalili/espresso-5.0.2/bin/pw.x -in si.in <http://si.in> | tee 
>>> si.out*, after writing htop from terminal, it seems the program doesn't use 
>>> cpus
>>> of three other nodes and just use the cpus of server. Could you tell me
>>> please how do I can use all my cpus.
>>>
>>> Regards
>>> --
>>> Khadije Khalili
>>> Ph.D Student of Solid-State Physics
>>> Department of Physics
>>> University of Mazandaran
>>> Babolsar, Iran
>>> kh.khal...@stu.umz.ac.ir
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Khadije Khalili
> Ph.D Student of Solid-State Physics
> Department of Physics
> University of Mazandaran
> Babolsar, Iran
> kh.khal...@stu.umz.ac.ir
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] run a program

2014-02-26 Thread jody

Hi
I think you should use the "--host" or "--hostfile" options:
  http://www.open-mpi.org/faq/?category=running#simple-spmd-run
  http://www.open-mpi.org/faq/?category=running#mpirun-host
Hope this helps
  Jody


On Wed, Feb 26, 2014 at 8:31 AM, raha khalili wrote:

> Dear Users
>
> This is my first post in open-mpi forum and I am beginner in using mpi.
> I want to run a program which does between 4 systems consist of one server
> and three nodes with 20 cpus. When I run: *mpirun -np 20
> /home/khalili/espresso-5.0.2/bin/pw.x -in si.in <http://si.in> | tee si.out*, 
> after writing htop from terminal, it seems the program doesn't use cpus
> of three other nodes and just use the cpus of server. Could you tell me
> please how do I can use all my cpus.
>
> Regards
> --
> Khadije Khalili
> Ph.D Student of Solid-State Physics
> Department of Physics
> University of Mazandaran
> Babolsar, Iran
> kh.khal...@stu.umz.ac.ir
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI send recv confusion

2013-02-18 Thread jody

Hi Pradeep

I am not sure if this is the reason, but usually it is a bad idea to
force an order of receives (such as you do in your receive loop -
first from sender 1 then from sender 2 then from sender 3)
Unless you implement it so, there is no guarantee the sends are
performed in this order. B

It is better if you accept messages from all senders (MPI_ANY_SOURCE)
instead of particular ranks and then check where the
message came from by examining the status fields
(http://www.mpi-forum.org/docs/mpi22-report/node47.htm)

Hope this helps
  Jody


On Mon, Feb 18, 2013 at 5:06 PM, Pradeep Jha
 wrote:
> I have attached a sample of the MPI program I am trying to write. When I run
> this program using "mpirun -np 4 a.out", my output is:
>
>  Sender:1
>  Data received from1
>  Sender:2
>  Data received from1
>  Sender:2
>
> And the run hangs there. I dont understand why does the "sender" variable
> change its value after MPI_recv? Any ideas?
>
> Thank you,
>
> Pradeep
>
>
>  program mpi_test
>
>   include  'mpif.h'
>
> !( Initialize variables )
>   integer, dimension(3) :: recv, send
>
>   integer :: sender, np, rank, ierror
>
>   call  mpi_init( ierror )
>   call  mpi_comm_rank( mpi_comm_world, rank, ierror )
>   call  mpi_comm_size( mpi_comm_world, np, ierror )
>
> !( Main program )
>
> ! receive the data from the other processors
>   if (rank.eq.0) then
>  do sender = 1, np-1
> print *, "Sender: ", sender
> call mpi_recv(recv, 3, mpi_int, sender, 1,
>  &   mpi_comm_world, status, ierror)
> print *, "Data received from ",sender
>  end do
>   end if
>
> !   send the data to the main processor
>   if (rank.ne.0) then
>  send(1) = 3
>  send(2) = 4
>  send(3) = 4
>  call mpi_send(send, 3, mpi_int, 0, 1, mpi_comm_world, ierr)
>   end if
>
>
> !( clean up )
>   call mpi_finalize(ierror)
>
>   return
>   end program mpi_test`
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mpi job is blocked

2012-09-25 Thread jody

Hi Richard

When a collective call hangs, this usually means that one (or more)
processes did not reach this command.
Are you sure that all processes reach the allreduce statement?

If something like this happens to me, i insert print statements just
before the MPI-call so i can see which processes made
it to this point and which ones did not.

Hope this helps a bit
  Jody

On Tue, Sep 25, 2012 at 8:20 AM, Richard  wrote:
> I have 3 computers with the same Linux system. I have setup the mpi cluster
> based on ssh connection.
> I have tested a very simple mpi program, it works on the cluster.
>
> To make my story clear, I name the three computer as A, B and C.
>
> 1) If I run the job with 2 processes on A and B, it works.
> 2) if I run the job with 3 processes on A, B and C, it is blocked.
> 3) if I run the job with 2 processes on A and C, it works.
> 4) If I run the job with all the 3 processes on A, it works.
>
> Using gdb I found the line at which it is blocked, it is here
>
> #7  0x2ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578,
> recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780,
> comm=0x627380)
> at pallreduce.c:105
> 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count,
>
> It seems that there is a communication problem between some computers. But
> the above series of test cannot tell me what
> exactly it is. Can anyone help me? thanks.
>
> Richard
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] deprecated MCA parameter

2012-08-28 Thread jody

Thanks Ralph

I renamed the parameter in my script,
and now there are no more ugly messages :)

Jody

On Tue, Aug 28, 2012 at 3:17 PM, Ralph Castain  wrote:
> Ah, I see - yeah, the parameter technically is being renamed to 
> "orte_rsh_agent" to avoid having users need to know the internal topology of 
> the code base (i.e., that it is in the plm framework and the rsh component). 
> It will always be there, though - only the name is changing to protect the 
> innocent. :-)
>
>
> On Aug 28, 2012, at 6:07 AM, jody  wrote:
>
>> Hi Rallph
>>
>> I get one of these messages
>> --
>> A deprecated MCA parameter value was specified in the environment or
>> on the command line.  Deprecated MCA parameters should be avoided;
>> they may disappear in future releases.
>>
>>  Deprecated parameter: plm_rsh_agent
>> --
>> for every process that starts...
>>
>> My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1)
>>
>> jody
>>
>> On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain  wrote:
>>> Guess I'm confused - what is the issue here? The param still exists:
>>>
>>> MCA plm: parameter "plm_rsh_agent" (current value: >> rsh>, data source: default value, synonyms:
>>>  pls_rsh_agent, orte_rsh_agent)
>>>  The command used to launch executables on remote 
>>> nodes (typically either "ssh" or "rsh")
>>>
>>> I am unaware of any plans to deprecate it. Is there a problem with it?
>>>
>>> On Aug 28, 2012, at 2:24 AM, jody  wrote:
>>>
>>>> Hi
>>>>
>>>> In order to open a xterm for each of my processes i use the MCA
>>>> parameter 'plm_rsh_agent'
>>>> like this:
>>>> mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached xterm  -hold -e
>>>> ./MPIProg
>>>>
>>>> Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows
>>>> from the remote:
>>>>
>>>> jody@boss /mnt/data1/neander $  mpirun -np 5 -hostfile allhosts
>>>> -mca plm_base_verbose 1   --leave-session-attached xterm -hold -e
>>>> ./MPIStruct
>>>> xterm: Xt error: Can't open display:
>>>> xterm: DISPLAY is not set
>>>> xterm: Xt error: Can't open display:
>>>> xterm: DISPLAY is not set
>>>> xterm: Xt error: Can't open display:
>>>> xterm: DISPLAY is not set
>>>> xterm: Xt error: Can't open display:
>>>> xterm: DISPLAY is not set
>>>> xterm: Xt error: Can't open display:
>>>> xterm: DISPLAY is not set
>>>> --
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --
>>>>
>>>> Is there some replacement for this parameter,
>>>> or how else can i get mpi to use" ssh -Y for" its connections?
>>>>
>>>> Thank You
>>>> jody
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] deprecated MCA parameter

2012-08-28 Thread jody

Hi Rallph

I get one of these messages
--
A deprecated MCA parameter value was specified in the environment or
on the command line.  Deprecated MCA parameters should be avoided;
they may disappear in future releases.

  Deprecated parameter: plm_rsh_agent
--
for every process that starts...

My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1)

jody

On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain  wrote:
> Guess I'm confused - what is the issue here? The param still exists:
>
>  MCA plm: parameter "plm_rsh_agent" (current value:  rsh>, data source: default value, synonyms:
>   pls_rsh_agent, orte_rsh_agent)
>   The command used to launch executables on remote 
> nodes (typically either "ssh" or "rsh")
>
> I am unaware of any plans to deprecate it. Is there a problem with it?
>
> On Aug 28, 2012, at 2:24 AM, jody  wrote:
>
>> Hi
>>
>> In order to open a xterm for each of my processes i use the MCA
>> parameter 'plm_rsh_agent'
>> like this:
>>  mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca
>> plm_rsh_agent "ssh -Y"  --leave-session-attached xterm  -hold -e
>> ./MPIProg
>>
>> Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows
>> from the remote:
>>
>> jody@boss /mnt/data1/neander $  mpirun -np 5 -hostfile allhosts
>> -mca plm_base_verbose 1   --leave-session-attached xterm -hold -e
>> ./MPIStruct
>> xterm: Xt error: Can't open display:
>> xterm: DISPLAY is not set
>> xterm: Xt error: Can't open display:
>> xterm: DISPLAY is not set
>> xterm: Xt error: Can't open display:
>> xterm: DISPLAY is not set
>> xterm: Xt error: Can't open display:
>> xterm: DISPLAY is not set
>> xterm: Xt error: Can't open display:
>> xterm: DISPLAY is not set
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>>
>> Is there some replacement for this parameter,
>> or how else can i get mpi to use" ssh -Y for" its connections?
>>
>> Thank You
>>  jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] deprecated MCA parameter

2012-08-28 Thread jody

Hi

In order to open a xterm for each of my processes i use the MCA
parameter 'plm_rsh_agent'
like this:
  mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached xterm  -hold -e
./MPIProg

Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows
from the remote:

jody@boss /mnt/data1/neander $  mpirun -np 5 -hostfile allhosts
-mca plm_base_verbose 1   --leave-session-attached xterm -hold -e
./MPIStruct
xterm: Xt error: Can't open display:
xterm: DISPLAY is not set
xterm: Xt error: Can't open display:
xterm: DISPLAY is not set
xterm: Xt error: Can't open display:
xterm: DISPLAY is not set
xterm: Xt error: Can't open display:
xterm: DISPLAY is not set
xterm: Xt error: Can't open display:
xterm: DISPLAY is not set
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

Is there some replacement for this parameter,
or how else can i get mpi to use" ssh -Y for" its connections?

Thank You
  jody

Re: [OMPI users] MPI_Irecv: Confusion with <> inputy parameter

2012-08-21 Thread jody

Hi Devendra

MPI has no way of knowing how big your receive buffer is -
that's why you have to pass the "count" argument, to tell MPI
how many items of your data type (in your case many bytes)
it may copy to your receive buffer.

When data arrives that is longer than the number you
specified in the "count" argument, the data will be cut off after
count bytes (and an error will be returned).
Any shorter amount of data will be copied to your receive buffer
and the call to MPI_Recv will terminate successfully.

It is your responsibility to pass the correct value of "count".

If you expect data of 160 bytes you have to allocate a buffer
with a size greater or equal to 160 and you have to set your
"count" parameter to the size you allocated.

If you want to receive data in chunks, you have to send it in chunks.

I hope this helps
  Jody


On Tue, Aug 21, 2012 at 10:01 AM, devendra rai  wrote:
> Hello Jeff and Hristo,
>
> Now I am completely confused:
>
> So, let's say, the complete reception requires 8192 bytes. And, I have:
>
> MPI_Irecv(
> (void*)this->receivebuffer,/* the receive buffer */
> this->receive_packetsize,  /* 80 */
> MPI_BYTE,   /* The data type
> expected */
> this->transmittingnode,/* The node from which to
> receive */
> this->uniquetag,   /* Tag */
> MPI_COMM_WORLD, /* Communicator */
> &Irecv_request  /* request handle */
> );
>
>
> That means, the the MPI_Test will tell me that the reception is complete
> when I have received the first 80 bytes. Correct?
>
> Next, let[s say that I have a receive buffer with a capacity of 160 bytes,
> then, will overflow error occur here? Even if I have decided to receive a
> large payload in chunks of 80 bytes?
>
> I am sorry, the manual and the API reference was too vague for me.
>
> Thanks a lot
>
> Devendra
> 
> From: "Iliev, Hristo" 
> To: Open MPI Users 
> Cc: devendra rai 
> Sent: Tuesday, 21 August 2012, 9:48
> Subject: Re: [OMPI users] MPI_Irecv: Confusion with <> inputy
> parameter
>
> Jeff,
>
>>> Or is it the number of elements that are expected to be received, and
>>> hence MPI_Test will tell me that the receive is not complete untill "count"
>>> number of elements have not been received?
>>
>> Yes.
>
> Answering "Yes" this question might further the confusion there. The "count"
> argument specifies the *capacity* of the receive buffer and the receive
> operation (blocking or not) will complete successfully for any matching
> message with size up to "count", even for an empty message with 0 elements,
> and will produce an overflow error if the received message was longer and
> data truncation has to occur.
>
> On 20.08.2012, at 16:32, Jeff Squyres  wrote:
>
>> On Aug 20, 2012, at 5:51 AM, devendra rai wrote:
>>
>>> Is it the number of elements that have been received *thus far* in the
>>> buffer?
>>
>> No.
>>
>>> Or is it the number of elements that are expected to be received, and
>>> hence MPI_Test will tell me that the receive is not complete untill "count"
>>> number of elements have not been received?
>>
>> Yes.
>>
>>> Here's the reason why I have a problem (and I think I may be completely
>>> stupid here, I'd appreciate your patience):
>> [snip]
>>> Does anyone see what could be going wrong?
>>
>> Double check that the (sender_rank, tag, communicator) tuple that you
>> issued in the MPI_Irecv matches the (rank, tag, communicator) tuple from the
>> sender (tag and communicator are arguments on the sending side, and rank is
>> the rank of the sender in that communicator).
>>
>> When receives block like this without completing like this, it usually
>> means a mismatch between the tuples.
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Hristo Iliev, Ph.D. -- High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241 80 24367 -- Fax/UMS: +49 241 80 624367
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Sharing (not copying) data with OpenMPI?

2012-04-17 Thread jody

Hi

Thank You all for your replies.
I'll certainly look into the MPI 3.0 RMA link (out of pure interest)
but i am afraid i can't go bleeding edge, because my application
will also have to run on an other machine.

As to OpenMP: i already make use of OpenMP in some places (for
instance for the creation of the large data block),
but unfortunately my main application is not well suited for OpenMP
parallelization..

I guess i'll have to take more detailed look at my problem to see if i
can restructure it in a good way...

Thank You
  Jody


On Mon, Apr 16, 2012 at 11:16 PM, Brian Austin  wrote:
> Maybe you meant to search for OpenMP instead of Open-MPI.
> You can achieve something close to what you want by using OpenMP for on-node
> parallelism and MPI for inter-node communication.
> -Brian
>
>
>
> On Mon, Apr 16, 2012 at 11:02 AM, George Bosilca 
> wrote:
>>
>> No currently there is no way in MPI (and subsequently in Open MPI) to
>> achieve this. However, in the next version of the MPI standard there will be
>> a function allowing processes to shared a memory segment
>> (https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/284).
>>
>> If you like living on the bleeding edge, you can try Brian's branch
>> implementing the MPI 3.0 RMA operations (including the shared memory
>> segment) from http://svn.open-mpi.org/svn/ompi/tmp-public/mpi3-onesided/.
>>
>>  george.
>>
>> On Apr 16, 2012, at 09:52 , jody wrote:
>>
>> > Hi
>> >
>> > In my application i have to generate a large block of data (several
>> > gigs) which subsequently has to be accessed by all processes (read
>> > only),
>> > Because of its size, it would take quite some time to serialize and
>> > send the data to the different processes. Furthermore, i risk
>> > running out of memory if this data is instantiated more than once on
>> > one machine.
>> >
>> > Does OpenMPI offer some way of sharing data between processes (on the
>> > same machine) without needing to send (and therefore copy) it?
>> >
>> > Or would i have to do this by means of creating shared memory, writing
>> > to it, and then make it accessible for reading by the processes?
>> >
>> > Thank You
>> >  Jody
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Sharing (not copying) data with OpenMPI?

2012-04-16 Thread jody

Hi

In my application i have to generate a large block of data (several
gigs) which subsequently has to be accessed by all processes (read
only),
Because of its size, it would take quite some time to serialize and
send the data to the different processes. Furthermore, i risk
running out of memory if this data is instantiated more than once on
one machine.

Does OpenMPI offer some way of sharing data between processes (on the
same machine) without needing to send (and therefore copy) it?

Or would i have to do this by means of creating shared memory, writing
to it, and then make it accessible for reading by the processes?

Thank You
  Jody

Re: [OMPI users] (no subject)

2012-03-16 Thread jody

Hi

Did you run your program with mpirun?
For example:
   mpirun -np 4 ./a.out

jody

On Fri, Mar 16, 2012 at 7:24 AM, harini.s ..  wrote:
> Hi ,
>
> I am very new to openMPI and I just installed openMPI 4.1.5 on Linux
> platform. Now am trying to run the examples in the folder got
> downloaded. But when i run , I got this
>
>>> a.out: error while loading shared libraries: libmpi.so.0: cannot open 
>>> shared object file: No such file or directory
>
> I got a.out when I compile hello_c.c using mpicc command.
> please help me to resolve this problem.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] MPI_Intercomm_create hangs

2012-01-23 Thread jody

Hi
I've got a really strange problem:

I've got an application which creates intercommunicators between a
master and some workers.

When i run it on our cluster with 11  processes it works,
when i run it with 12 processes it hangs inside MPI_Intercomm_create().

This is the hostfile:
  squid_0.uzh.ch  slots=3  max-slots=3
  squid_1.uzh.ch  slots=2  max-slots=2
  squid_2.uzh.ch  slots=1  max-slots=1
  squid_3.uzh.ch  slots=1  max-slots=1
  triops.uzh.ch   slots=8 max-slots=8

Actually all squid_X have 4 cores, but i managed to reduce the number of
processes needed for failure by making the above settings.

So with all available squid cores and 3 triops cores it works,
but with 4 triops cores it hangs.

On the other hand, if i use all 16 squid cores (but no triops cores)
it works, too.

If i start the application not from triopps, but froim another workstation,
i have a similar pattern of Intercomm_create failures.

Note that with the above hostfile a simple HelloMPI works also with 14
or more processes.

The frustrating thing is that this exact same code has worked before!

Does anybody have an explanation?
Thank You

I managed to simplify the application:

#include 
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
int iResult = 0;
int iNumProcs = 0;
int iID = -1;

MPI_Init(&iArgC, &apArgV);

MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &iID);

int iKey;
if (iID == 0) {
iKey = 0;

} else {
iKey = 1;
}

MPI_Comm  commInter1;
MPI_Comm  commInter2;
MPI_Comm  commIntra;

MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra);

int iRankM;
MPI_Comm_rank(commIntra, &iRankM);
printf("Local rank: %d\n", iRankM);

switch (iKey) {
case 0:
printf("Creating intercomm 1 for Master (%d)\n", iID);
MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2);
break;
case 1:
printf("Creating intercomm 1 for FH (%d)\n", iID);
MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1);
}

printf("finalizing\n");
MPI_Finalize();

printf("exiting with %d\n", iResult);
return iResult;
}

Re: [OMPI users] Passwordless ssh

2011-12-21 Thread jody

Hi

You also must make sure that all slaves can
connect via ssh to each other and to the master
node without ssh.

Jody


On Wed, Dec 21, 2011 at 3:57 AM, Shaandar Nyamtulga  wrote:
> Can you clarify your answer please.
> I have one master node and other slave nodes. I created rsa key on my master
> node and copied it to all slaves.
> /home/mpiuser directory of all nodes are shared through NFS.The strange
> thing is why it requires password after I mount a slave and do ssh to the
> slave.
> When I dismount I can ssh without password.
>
>
> 
> Date: Tue, 20 Dec 2011 10:45:12 +0100
> From: mathieu.westp...@obs.ujf-grenoble.fr
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Passwordless ssh
>
>
> Hello
>
> You have to copy nodeX public key at the end of nodeY authorizedkeys.
>
>
> Mathieu
> Le 20/12/2011 05:03, Shaandar Nyamtulga a écrit :
>
> Hi
> I built Beuwolf cluster using OpenMPI reading the following link.
> http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-open-mpi-on-linux/
> I can access my nodes without password before mounting my slaves.
> But when I mount my slaves and run a program, it asks again passwords.
>
> $ eval `ssh-agent`
>
> $ ssh-add ~/.ssh/id_dsa
>
> The above is not working. Terminal gives the reply "Could not open a
> connection to your authentication agent."
>
> Help is needed urgently.
>
> Thank you
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___ users mailing list
> us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Error using hostfile

2011-07-09 Thread jody

Hi
If your LD_LIBRARY_PATH is not set for a non-interactive startup,
then successful runs on the remote machines may not be sufficient evidence.

Check this FAQ
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

To see if your variables are set correctly for
non-interactive sessions on your nodes,
you can execute
  mpirun --hostfile hostfile -np 4 printenv
and scan the output for PATH and LD_LIBRARY_PATH.

Hope this helps
  Jody


On Sat, Jul 9, 2011 at 12:25 AM, Mohan, Ashwin  wrote:
> Thanks Ralph.
>
>
>
> I have emailed the network admin on the firewall issue.
>
>
>
> About the PATH and LIBRARY PATH issue, is it sufficient evidence that the
> path are set alright if I am able to compile and run successfully on
> individual nodes mentioned in the machine file.
>
>
>
> Thanks,
> Ashwin.
>
>
>
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Friday, July 08, 2011 1:58 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error using hostfile
>
>
>
> Is there a firewall in the way? The error indicates that daemons were
> launched on the remote machines, but failed to communicate back.
>
>
>
> Also, check that your remote PATH and LD_LIBRARY_PATH are being set to the
> right place to pickup this version of OMPI. Lots of systems deploy with
> default versions that may not be compatible, so if you wind up running a
> daemon on the remote node that comes from another installation, things won't
> work.
>
>
>
>
>
> On Jul 8, 2011, at 10:52 AM, Mohan, Ashwin wrote:
>
> Hi,
>
> I am following up on a previous error posted. Based on the previous
> recommendation, I did set up a password less SSH login.
>
>
>
> I created a hostfile comprising of 4 nodes (w/ each node having 4 slots). I
> tried to run my job on 4 slots but get no output. Hence, I end up killing
> the job. I am trying to run a simple MPI program on 4 nodes and trying to
> figure out what could be the issue.  What could I check to ensure that I can
> run jobs on 4 nodes (each node has 4 slots)
>
>
>
> Here is the simple MPI program I am trying to execute on 4 nodes
>
> **
>
> if (my_rank != 0)
>
> {
>
>     sprintf(message, "Greetings from the process %d!", my_rank);
>
>     dest = 0;
>
>     MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
> MPI_COMM_WORLD);
>
> }
>
> else
>
> {
>
> for (source = 1;source < p; source++)
>
> {
>
>     MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD,
> &status);
>
>     printf("%s\n", message);
>
> }
>
>
>
> 
>
> My hostfile looks like this:
>
>
>
> [amohan@myocyte48 ~]$ cat hostfile
>
> myocyte46
>
> myocyte47
>
> myocyte48
>
> myocyte49
>
> ***
>
>
>
> I use the following run command: : mpirun --hostfile hostfile -np 4 new46
>
> And receive a blank screen. Hence, I have to kill the job.
>
>
>
> OUTPUT ON KILLING JOB:
>
> mpirun: killing job...
>
> --
>
> mpirun noticed that the job aborted, but has no info as to the process
>
> that caused that situation.
>
> --
>
> --
>
> mpirun was unable to cleanly terminate the daemons on the nodes shown
>
> below. Additional manual cleanup may be required - please refer to
>
> the "orte-clean" tool for assistance.
>
> --
>
>     myocyte46 - daemon did not report back when launched
>
>     myocyte47 - daemon did not report back when launched
>
>     myocyte49 - daemon did not report back when launched
>
>
>
> Thanks,
>
> Ashwin.
>
>
>
>
>
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Wednesday, July 06, 2011 6:46 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error using hostfile
>
>
>
> Please see http://www.open-mpi.org/faq/?category=rsh#ssh-keys
>
>
>
>
>
> On Jul 6, 2011, at 5:09 PM, Mohan, Ashwin wrote:
>
>
> Hi,
>
>
>
> I use the following command (mpirun --prefix /usr/local/openmpi1.4.3 -np 4
> hello) to successfully execute a simple hello world command on a single
> node.  Each node has 4 slots.  Following the successful execution on one
>

Re: [OMPI users] a question about mpirun

2011-07-07 Thread jody

Hi

It seems that you have mixed an "old" LAM-MPI installation with OpenMPI.


To make sure your OpenMPI installation is ok you could try to use the
complete path to mpirun:
  /data1/cluster/openmpi/bin/mpirun -np 1  /tmp/openmpi-1.4.3/examples/ring_c

You should also make sure that the compile-command is the one of
OpenMPI and not of LAM MPI.
( /data1/cluster/openmpi/bin/mpiCC or something like that)

Check your PATH environment variable to make sure it doesn't contain
any of  the LAM MPI directories,
and make sure you set the LD_LIBRARY_PATH variable correctly (see
http://www.open-mpi.org/faq/?category=running#run-prereqs)

Hope this helps
  Jody


On Thu, Jul 7, 2011 at 8:44 AM, zhuangchao  wrote:
> hello all :
>
>    I  installed  the openmpi-1.4.3  on redhat as the following step :
>
>    1.  ./configure  --prefix=/data1/cluster/openmpi
>
>    2.  make
>
>    3.  make  install
>
>    And  I   compiled  the  examples  of  openmpi-1.4.3  as the following
> step :
>
>    1. make
>
>     Then  I   run   the example :
>
>     ./mpirun  -np 1  /tmp/openmpi-1.4.3/examples/ring_c
>
>     I  get  the following  error :
>
> -
> It seems that there is no lamd running on the host node1.
> This indicates that the LAM/MPI runtime environment is not operating.
> The LAM/MPI runtime environment is necessary for MPI programs to run
> (the MPI program tired to invoke the "MPI_Init" function).
>
> Please run the "lamboot" command the start the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to invoke
> "lamboot" across multiple machines.
> -
>
>    I   run openmpi , but  I  get  the error from lam-mpi .  why ?
> Can  you  help me ?
>
>    Thank you !
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] data types and alignment to word boundary

2011-06-30 Thread jody

OOOps - i did not intend to cause any heart attacks =:)

Perhaps my reaction was a bit exaggerated, but i spent quite some time
to figure out why i didn't receive the same numbers i sent off
And, after reading section 3.1 of the MPI complete reference i must say
that i would have been warned if i had read that chapter more carefully...

Fortunately, i don't have to send around a lot of these structs,
so i will do the padding (using the offsetof macro Dave recommended).

Thanks again
  Jody


On Wed, Jun 29, 2011 at 9:52 PM, Gus Correa  wrote:
> Hi Jody
>
> jody wrote:
>>
>> Guys - Thank You for your replies!
>> (wow : that was a rhyme! :) )
>>
>> I checked my structure with the offsetof macro on my laptop at home
>> and found the following offsets:
>> offs iSpeciesID:  0
>> offs sCapacityFile:  2
>> offs adGParams:  68
>> total size             100
>> so there seems to be a 2 byte gap before the double array;
>> and this machine seems to  prefer multiples of 4.
>
> A 32-bit laptop perhaps?
> I would guess the offsets are machine and compiler dependent,
> and optimization flags may matter.
>
>>
>> But is this alignment problem not also a danger for heterogeneous clusters
>> using OpenMPI?
>
> Do you mean danger or excitement?  :)
> If the doubles and shorts and long longs have different sizes on
> each of two heterogeneous nodes, what could MPI do about them anyway?
>
>> I guess the only portable solution is to forget about MPI Data types and
>>  somehow pack or serialize the data before sending and unpack/deserialize
>> after receiving it.
>>
>
> Jody:
> Jeff may have a heart attack when he reads what you just wrote about
> the usefulness of MPI data types vs. packing/unpacking.  :)
>
> Guessing away, I would think you are focusing on memory/space savings,
> rather than on performance.
> Maybe memory/space savings is part of your code requirements.
>
> However, have you tried instead to explicitly pad your structure,
> say, to a multiple of the size of your largest intrinsic type,
> which double in your case, or perhaps to a multiple of the natural
> memory alignment boundary that your computer/compiler likes (which may
> be 8 bytes, 16 bytes, 128 bytes, whatever).
> I never did this comparison, but I would guess the padded version
> of the code would run faster (if compiled with '-align' type of flag
> and friends).
>
> Anyway, C is a foreign language here, I must say.
>
> Just my unwarranted guesses.
>
> Gus Correa
>
>>
>>
>> On Wed, Jun 29, 2011 at 6:18 PM, Gus Correa  wrote:
>>>
>>> jody wrote:
>>>>
>>>> Hi
>>>>
>>>> I have noticed on my machine that a struct which i have defined as
>>>>
>>>> typedef struct {
>>>>   short  iSpeciesID;
>>>>   char   sCapacityFile[SHORT_INPUT];
>>>>   double adGParams[NUM_GPARAMS];
>>>> } tVStruct;
>>>>
>>>> (where SHORT_INPUT=64 and NUM_GPARAMS=4)
>>>>
>>>> has size 104 (instead of 98) whereas the corresponding MPI Datatype i
>>>> created
>>>>
>>>>   int aiLengthsT5[3]       = {1, SHORT_INPUT, NUM_GPARAMS};
>>>>   MPI_Aint aiDispsT5[3]    = {0, iShortSize, iShortSize+SHORT_INPUT};
>>>>   MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE};
>>>>   MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5,
>>>> &m_dtVegetationData3);
>>>>   MPI_Type_commit(&m_dtVegetationData3);
>>>>
>>>> only has length 98 (as expected). The size differences resulted in an
>>>> error when doing
>>>>
>>>>   tVegetationData3 VD;
>>>>   MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD,
>>>> MPI_COMM_WORLD);
>>>>
>>>> and the corresponding
>>>>
>>>>   tVegetationData3 VD;
>>>>   MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE,
>>>> TAG_STEP_CMD, MPI_COMM_WORLD, &st);
>>>>
>>>> (in fact, the last double in my array was not transmitted correctly)
>>>>
>>>> It seems that on my machine the struct was padded to a multiple of 8.
>>>> By manually adding some padding bytes to my MPI Datatype in order
>>>> to fill it up to the next multiple of 8 i could work around this
>>>> problem.
>>>> (not very nice, and very probably not portable)
>>>>
>>>>
>>>> My question: is there a way to tell MPI to automatically use the
>>>&g

Re: [OMPI users] data types and alignment to word boundary

2011-06-29 Thread jody

Guys - Thank You for your replies!
(wow : that was a rhyme! :) )

I checked my structure with the offsetof macro on my laptop at home
and found the following offsets:
offs iSpeciesID:  0
offs sCapacityFile:  2
offs adGParams:  68
total size 100
so there seems to be a 2 byte gap before the double array;
and this machine seems to  prefer multiples of 4.

But is this alignment problem not also a danger for heterogeneous clusters
using OpenMPI?
I guess the only portable solution is to forget about MPI Data types and
 somehow pack or serialize the data before sending and unpack/deserialize
after receiving it.

Jody


On Wed, Jun 29, 2011 at 6:18 PM, Gus Correa  wrote:
> jody wrote:
>>
>> Hi
>>
>> I have noticed on my machine that a struct which i have defined as
>>
>> typedef struct {
>>    short  iSpeciesID;
>>    char   sCapacityFile[SHORT_INPUT];
>>    double adGParams[NUM_GPARAMS];
>> } tVStruct;
>>
>> (where SHORT_INPUT=64 and NUM_GPARAMS=4)
>>
>> has size 104 (instead of 98) whereas the corresponding MPI Datatype i
>> created
>>
>>    int aiLengthsT5[3]       = {1, SHORT_INPUT, NUM_GPARAMS};
>>    MPI_Aint aiDispsT5[3]    = {0, iShortSize, iShortSize+SHORT_INPUT};
>>    MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE};
>>    MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5,
>> &m_dtVegetationData3);
>>    MPI_Type_commit(&m_dtVegetationData3);
>>
>> only has length 98 (as expected). The size differences resulted in an
>> error when doing
>>
>>    tVegetationData3 VD;
>>    MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD, MPI_COMM_WORLD);
>>
>> and the corresponding
>>
>>    tVegetationData3 VD;
>>    MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE,
>> TAG_STEP_CMD, MPI_COMM_WORLD, &st);
>>
>> (in fact, the last double in my array was not transmitted correctly)
>>
>> It seems that on my machine the struct was padded to a multiple of 8.
>> By manually adding some padding bytes to my MPI Datatype in order
>> to fill it up to the next multiple of 8 i could work around this problem.
>> (not very nice, and very probably not portable)
>>
>>
>> My question: is there a way to tell MPI to automatically use the
>> required padding?
>>
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> Hi Jody
>
> My naive guesses:
>
> I think when you create the MPI structure you can pass the
> byte displacement of each structure component.
> You would need to modify your aiDispsT5[3], to match the
> actual memory alignment, I guess.
> Yes, indeed portability may be sacrificed.
>
> There is some clarification in "MPI, The Complete Reference, Vol 1,
> 2nd Ed, Marc Snir et al.".
> Section 3.2 and 3.3 (general on type map & type signature).
> Section 3.4.8 MPI_Type_create_struct (examples, specially 3.13).
> Section 3.10, on portability, doesn't seem to guarantee portability of
> MPI_Type_Struct.
>
> I hope this helps,
> Gus Correa
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] data types and alignment to word boundary

2011-06-29 Thread jody

Hi

I have noticed on my machine that a struct which i have defined as

typedef struct {
short  iSpeciesID;
char   sCapacityFile[SHORT_INPUT];
double adGParams[NUM_GPARAMS];
} tVStruct;

(where SHORT_INPUT=64 and NUM_GPARAMS=4)

has size 104 (instead of 98) whereas the corresponding MPI Datatype i created

int aiLengthsT5[3]   = {1, SHORT_INPUT, NUM_GPARAMS};
MPI_Aint aiDispsT5[3]= {0, iShortSize, iShortSize+SHORT_INPUT};
MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE};
MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5,
&m_dtVegetationData3);
MPI_Type_commit(&m_dtVegetationData3);

only has length 98 (as expected). The size differences resulted in an
error when doing

tVegetationData3 VD;
MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD, MPI_COMM_WORLD);

and the corresponding

tVegetationData3 VD;
MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE,
TAG_STEP_CMD, MPI_COMM_WORLD, &st);

(in fact, the last double in my array was not transmitted correctly)

It seems that on my machine the struct was padded to a multiple of 8.
By manually adding some padding bytes to my MPI Datatype in order
to fill it up to the next multiple of 8 i could work around this problem.
(not very nice, and very probably not portable)


My question: is there a way to tell MPI to automatically use the
required padding?


Thank You
  Jody

Re: [OMPI users] problems with the -xterm option

2011-05-03 Thread jody

Launching xterm by mpirun onto a remote platform without a command
simply opens a xterm-window which sits there until you type exit into it
or close it by pressing on the frame's close button.
(of course only if the display is forwarded to the local machine)



On Mon, May 2, 2011 at 4:30 PM, Ralph Castain  wrote:
>
> On May 2, 2011, at 8:21 AM, jody wrote:
>
>> Hi
>> Well, the difference is that one time i call the application
>> 'HelloMPI' with the '--xterm' option,
>> whereas in my previous mail i am calling the application 'xterm'
>> (without the '--xterm' option)
>
> Ah, well that might explain it. I don't know how xterm would react to just 
> being launched by mpirun onto a remote platform without any command to run. I 
> can't explain what the plm verbosity has to do with anything, though.
>
>> Jody
>>
>> On Mon, May 2, 2011 at 4:08 PM, Ralph Castain  wrote:
>>>
>>> On May 2, 2011, at 7:56 AM, jody wrote:
>>>
>>>> Hi Ralph
>>>>
>>>> Thank You for doing the fix.
>>>>
>>>> Do you perhaps also have an idea what is going on when i try to start
>>>> xterm (or probably an other X application) on a remote host?
>>>> In this case it is not enough to specify the '--leave-session-attached' 
>>>> option.
>>>>
>>>> These calls won't open any xterms
>>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>>> plm_base_verbose 1 xterm
>>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
>>>> --leave-session-attached xterm
>>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>>> odls_base_verbose 5 xterm
>>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>>> odls_base_verbose 5 --leave-session-attached xterm
>>>>
>>>> But this will open the xterms:
>>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>>> plm_base_verbose 1  --leave-session-attached xterm
>>>>
>>>> Any verbosity level > 0 will open xterms, but with ' -mca
>>>> plm_base_verbose 0' there are again no xterms.
>>>>
>>>
>>> No earthly idea...this seems to contradict what you had below. You said you 
>>> were seeing the xterms with this cmd line:
>>>
>>>>>> I just found that everything works as expected if i use the the
>>>>>> '--leave-session-attached' option (without the debug options):
>>>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>>>> ./HelloMPI
>>>>>> The xterms are also opened if i do not use the '!' hold option.
>>>>>
>>>
>>> Did I miss something?
>>>
>>>
>>>> Thank You
>>>>  Jody
>>>>
>>>> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain  wrote:
>>>>>
>>>>> On May 2, 2011, at 2:34 AM, jody wrote:
>>>>>
>>>>>> Hi Ralph
>>>>>>
>>>>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and 
>>>>>> squid_0.
>>>>>> The results are interesting!
>>>>>>
>>>>>> I wrote a small HelloMPI app which basically calls usleep for a pause
>>>>>> of 5 seconds.
>>>>>>
>>>>>> Now calling it as i did before, no MPI errors appear anymore, only the
>>>>>> display problems:
>>>>>>  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>>>>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>>>>>
>>>>>> When i do the same call *with* the debug option, the xterm appears and
>>>>>> shows the output of HelloMPI!
>>>>>> I attach the output in ompidbg_1.txt (It also works if i call with
>>>>>> '-np 4' and '--xterm 0,1,2,3'
>>>>>
>>>>> Good!
>>>>>
>>>>>>
>>>>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>>>>>
>>>>>> If i use the hold-option, the xterm appears with the outpu

Re: [OMPI users] problems with the -xterm option

2011-05-02 Thread jody

Hi
Well, the difference is that one time i call the application
'HelloMPI' with the '--xterm' option,
whereas in my previous mail i am calling the application 'xterm'
(without the '--xterm' option)

Jody

On Mon, May 2, 2011 at 4:08 PM, Ralph Castain  wrote:
>
> On May 2, 2011, at 7:56 AM, jody wrote:
>
>> Hi Ralph
>>
>> Thank You for doing the fix.
>>
>> Do you perhaps also have an idea what is going on when i try to start
>> xterm (or probably an other X application) on a remote host?
>> In this case it is not enough to specify the '--leave-session-attached' 
>> option.
>>
>> These calls won't open any xterms
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> plm_base_verbose 1 xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
>> --leave-session-attached xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> odls_base_verbose 5 xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> odls_base_verbose 5 --leave-session-attached xterm
>>
>> But this will open the xterms:
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> plm_base_verbose 1  --leave-session-attached xterm
>>
>> Any verbosity level > 0 will open xterms, but with ' -mca
>> plm_base_verbose 0' there are again no xterms.
>>
>
> No earthly idea...this seems to contradict what you had below. You said you 
> were seeing the xterms with this cmd line:
>
>>>> I just found that everything works as expected if i use the the
>>>> '--leave-session-attached' option (without the debug options):
>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>> ./HelloMPI
>>>> The xterms are also opened if i do not use the '!' hold option.
>>>
>
> Did I miss something?
>
>
>> Thank You
>>  Jody
>>
>> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain  wrote:
>>>
>>> On May 2, 2011, at 2:34 AM, jody wrote:
>>>
>>>> Hi Ralph
>>>>
>>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>>>> The results are interesting!
>>>>
>>>> I wrote a small HelloMPI app which basically calls usleep for a pause
>>>> of 5 seconds.
>>>>
>>>> Now calling it as i did before, no MPI errors appear anymore, only the
>>>> display problems:
>>>>  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>>>
>>>> When i do the same call *with* the debug option, the xterm appears and
>>>> shows the output of HelloMPI!
>>>> I attach the output in ompidbg_1.txt (It also works if i call with
>>>> '-np 4' and '--xterm 0,1,2,3'
>>>
>>> Good!
>>>
>>>>
>>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>>>
>>>> If i use the hold-option, the xterm appears with the output of
>>>> 'hostrname' (cf. ompidbg_3.txt)
>>>> The xterm opens after the line "launch complete for job..." has been
>>>> written (line 59)
>>>
>>> Okay, that's also expected. Like I said, without the "hold", the output is 
>>> generated so quickly that the window just flashes at best. I've had similar 
>>> experiences - hence the "hold" option.
>>>
>>>>
>>>> I just found that everything works as expected if i use the the
>>>> '--leave-session-attached' option (without the debug options):
>>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>> ./HelloMPI
>>>> The xterms are also opened if i do not use the '!' hold option.
>>>
>>> Okay, I can understand why. The --leave-session-attached option just tells 
>>> mpirun to not daemonize the backend daemons - thus leaving the ssh session 
>>> alive. The debug options do the same thing, but turn on all the debug 
>>> output.
>>>
>>> The problem is that if you don't leave the ssh sessi

Re: [OMPI users] problems with the -xterm option

2011-05-02 Thread jody

Hi Ralph

Thank You for doing the fix.

Do you perhaps also have an idea what is going on when i try to start
xterm (or probably an other X application) on a remote host?
In this case it is not enough to specify the '--leave-session-attached' option.

These calls won't open any xterms
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
plm_base_verbose 1 xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
--leave-session-attached xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
odls_base_verbose 5 xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
odls_base_verbose 5 --leave-session-attached xterm

But this will open the xterms:
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
plm_base_verbose 1  --leave-session-attached xterm

Any verbosity level > 0 will open xterms, but with ' -mca
plm_base_verbose 0' there are again no xterms.

Thank You
  Jody

On Mon, May 2, 2011 at 2:29 PM, Ralph Castain  wrote:
>
> On May 2, 2011, at 2:34 AM, jody wrote:
>
>> Hi Ralph
>>
>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>> The results are interesting!
>>
>> I wrote a small HelloMPI app which basically calls usleep for a pause
>> of 5 seconds.
>>
>> Now calling it as i did before, no MPI errors appear anymore, only the
>> display problems:
>>  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>
>> When i do the same call *with* the debug option, the xterm appears and
>> shows the output of HelloMPI!
>> I attach the output in ompidbg_1.txt (It also works if i call with
>> '-np 4' and '--xterm 0,1,2,3'
>
> Good!
>
>>
>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>
>> If i use the hold-option, the xterm appears with the output of
>> 'hostrname' (cf. ompidbg_3.txt)
>> The xterm opens after the line "launch complete for job..." has been
>> written (line 59)
>
> Okay, that's also expected. Like I said, without the "hold", the output is 
> generated so quickly that the window just flashes at best. I've had similar 
> experiences - hence the "hold" option.
>
>>
>> I just found that everything works as expected if i use the the
>> '--leave-session-attached' option (without the debug options):
>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>> ./HelloMPI
>> The xterms are also opened if i do not use the '!' hold option.
>
> Okay, I can understand why. The --leave-session-attached option just tells 
> mpirun to not daemonize the backend daemons - thus leaving the ssh session 
> alive. The debug options do the same thing, but turn on all the debug output.
>
> The problem is that if you don't leave the ssh session alive, then the xterm 
> has no way back to your screen. By daemonizing, we severe that connection.
>
> What I should do (and maybe used to do, but it got removed) is automatically 
> turn "on" the leave-session-attached option if you give --xterm. I can enter 
> that patch.
>
> Note that this does limit the size of the launch to the number of ssh 
> sessions the system allows you to have open at the same time. We default to a 
> limit of 128 nodes, which is likely adequate for an xterm-based debugging 
> session. However, you can increase it using an mca param (see ompi_info) to 
> as high as the system allows.
>
> Thanks for helping debug this! I'll add you to the patch list so you can 
> track it.
>
>>
>> What does *not* work is
>>  jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>
>> But then again, this call works (i.e. an xterm is opened) if all the
>> debug-options are used (ompidbg_4.txt).
>> Here the '--leave-session-attached' is necessary - without it, no xterm.
>>
>>> From these results i would say that there is no basic mishandling of
>> 'ssh', though i have no idea
>> what internal differences the use of the '-leave-session-attached'
>> option or the debug options make.
>>
>> I hope these

Re: [OMPI users] problems with the -xterm option

2011-05-02 Thread jody

Hi Ralph

I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
The results are interesting!

I wrote a small HelloMPI app which basically calls usleep for a pause
of 5 seconds.

Now calling it as i did before, no MPI errors appear anymore, only the
display problems:
  jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
  /usr/bin/xterm Xt error: Can't open display: localhost:10.0

When i do the same call *with* the debug option, the xterm appears and
shows the output of HelloMPI!
I attach the output in ompidbg_1.txt (It also works if i call with
'-np 4' and '--xterm 0,1,2,3'

Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).

If i use the hold-option, the xterm appears with the output of
'hostrname' (cf. ompidbg_3.txt)
The xterm opens after the line "launch complete for job..." has been
written (line 59)

I just found that everything works as expected if i use the the
'--leave-session-attached' option (without the debug options):
  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
./HelloMPI
The xterms are also opened if i do not use the '!' hold option.

What does *not* work is
  jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set

But then again, this call works (i.e. an xterm is opened) if all the
debug-options are used (ompidbg_4.txt).
Here the '--leave-session-attached' is necessary - without it, no xterm.

>From these results i would say that there is no basic mishandling of
'ssh', though i have no idea
what internal differences the use of the '-leave-session-attached'
option or the debug options make.

I hope these observations are helpful
  Jody


On Fri, Apr 29, 2011 at 12:08 AM, jody  wrote:
> Hi Ralph
>
> Thank you for your suggestions.
> I'll be happy to help  you.
> I'm not sure if i'll get around to this tomorrow,
> but i certainly will do so on Monday.
>
> Thanks
>  Jody
>
> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain  wrote:
>> Hi Jody
>>
>> I'm not sure when I'll get a chance to work on this - got a deadline to 
>> meet. I do have a couple of suggestions, if you wouldn't mind helping debug 
>> the problem?
>>
>> It looks to me like the problem is that mpirun is crashing or terminating 
>> early for some reason - hence the failures to send msgs to it, and the 
>> "lifeline lost" error that leads to the termination of the daemon. If you 
>> build a debug version of the code (i.e., --enable-debug on configure), you 
>> can get a lot of debug info that traces the behavior.
>>
>> If you could then run your program with
>>
>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>
>> and send it to me, we'll see what ORTE thinks it is doing.
>>
>> You could also take a look at the code for implementing the xterm option. 
>> You'll find it in
>>
>> orte/mca/odls/base/odls_base_default_fns.c
>>
>> around line 1115. The xterm command syntax is defined in
>>
>> orte/mca/odls/base/odls_base_open.c
>>
>> around line 233 and following. Note that we use "xterm -T" as the cmd. 
>> Perhaps you can spot an error in the way we treat xterm?
>>
>> Also, remember that you have to specify that you want us to "hold" the xterm 
>> window open even after the process terminates. If you don't specify it, the 
>> window automatically closes upon completion of the process. So a 
>> fast-running cmd like "hostname" might disappear so quickly that it causes a 
>> race condition problem.
>>
>> You might want to try a spinner application - i.e.., output something and 
>> then sit in a loop or sleep for some period of time. Or, use the "hold" 
>> option to keep the window open - you designate "hold" by putting a '!' 
>> before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>
>>
>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>
>>> Hi
>>>
>>> Unfortunately this does not solve my problem.
>>> While i can do
>>>  ssh -Y squid_0 xterm
>>> and this will open an xterm on m,y machiine (chefli),
>>> i run into problems with the -xterm option of openmpi:
>>>
>>>  jody@chefli ~/share/neander $ mp

Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody

Hi Ralph

Thank you for your suggestions.
I'll be happy to help  you.
I'm not sure if i'll get around to this tomorrow,
but i certainly will do so on Monday.

Thanks
  Jody

On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain  wrote:
> Hi Jody
>
> I'm not sure when I'll get a chance to work on this - got a deadline to meet. 
> I do have a couple of suggestions, if you wouldn't mind helping debug the 
> problem?
>
> It looks to me like the problem is that mpirun is crashing or terminating 
> early for some reason - hence the failures to send msgs to it, and the 
> "lifeline lost" error that leads to the termination of the daemon. If you 
> build a debug version of the code (i.e., --enable-debug on configure), you 
> can get a lot of debug info that traces the behavior.
>
> If you could then run your program with
>
>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>
> and send it to me, we'll see what ORTE thinks it is doing.
>
> You could also take a look at the code for implementing the xterm option. 
> You'll find it in
>
> orte/mca/odls/base/odls_base_default_fns.c
>
> around line 1115. The xterm command syntax is defined in
>
> orte/mca/odls/base/odls_base_open.c
>
> around line 233 and following. Note that we use "xterm -T" as the cmd. 
> Perhaps you can spot an error in the way we treat xterm?
>
> Also, remember that you have to specify that you want us to "hold" the xterm 
> window open even after the process terminates. If you don't specify it, the 
> window automatically closes upon completion of the process. So a fast-running 
> cmd like "hostname" might disappear so quickly that it causes a race 
> condition problem.
>
> You might want to try a spinner application - i.e.., output something and 
> then sit in a loop or sleep for some period of time. Or, use the "hold" 
> option to keep the window open - you designate "hold" by putting a '!' before 
> the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>
>
> On Apr 28, 2011, at 8:38 AM, jody wrote:
>
>> Hi
>>
>> Unfortunately this does not solve my problem.
>> While i can do
>>  ssh -Y squid_0 xterm
>> and this will open an xterm on m,y machiine (chefli),
>> i run into problems with the -xterm option of openmpi:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 --xterm 1 hostname
>>  squid_0
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>> lifeline [[35219,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>> By the way when i look at the DISPLAY variable in the xterm window
>> opened via squid_0,
>> i also have the display variable "localhost:11.0"
>>
>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>> appear:
>>
>>  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  squid_0
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>> lifeline [[34926,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>
>>
>> I have doubts that the "-Y" is passed correctly:
>>   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>> -Y" -host squid_0 xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>
>>
>> ---> as a matte

Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody

Hi

Unfortunately this does not solve my problem.
While i can do
  ssh -Y squid_0 xterm
and this will open an xterm on m,y machiine (chefli),
i run into problems with the -xterm option of openmpi:

  jody@chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
-Y" -host squid_0 --xterm 1 hostname
  squid_0
  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
lifeline [[35219,0],0] lost
  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
lifeline [[35219,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0

By the way when i look at the DISPLAY variable in the xterm window
opened via squid_0,
i also have the display variable "localhost:11.0"

Actually, the difference with using the "-mca plm_rsh_agent" is that
the lines wiht the warnings about "xauth" and "untrusted X" do not
appear:

  jody@chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  squid_0
  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
lifeline [[34926,0],0] lost
  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
lifeline [[34926,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0


I have doubts that the "-Y" is passed correctly:
   jody@triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
-Y" -host squid_0 xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set


---> as a matter of fact i noticed that the xterm option doesn't work locally:
  mpirun -np 4-xterm 1 /usr/bin/printenv
prints verything onto the console.

Do you have any other suggestions i could try?

Thank You
 Jody

On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain  wrote:
> Should be able to just set
>
> -mca plm_rsh_agent "ssh -Y"
>
> on your cmd line, I believe
>
> On Apr 28, 2011, at 12:53 AM, jody wrote:
>
>> Hi Ralph
>>
>> Is there an easy way i could modify the OpenMPI code so that it would use
>> the -Y option for ssh when connecting to remote machines?
>>
>> Thank You
>>   Jody
>>
>> On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
>>> Hi Ralph
>>> thank you for your suggestions. After some fiddling, i found that after my
>>> last update (gentoo) my sshd_config had been overwritten
>>> (X11Forwarding was set to 'no').
>>>
>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>> and with 'ssh -X'
>>> (but with '-X' is till get those xauth warnings)
>>>
>>> But the xterm option still doesn't work:
>>>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>> printenv | grep WORLD_RANK
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>> generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>  OMPI_COMM_WORLD_RANK=0
>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>> lifeline [[54132,0],0] lost
>>>
>>> So it looks like the two processes from squid_0 can't open the display this 
>>> way,
>>> but one of them writes the output to the console...
>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' 
>>> the
>>> DISPLAY variable is set to 'localhost:10.0'
>>>
>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>
>>> Thank You
>>>  Jody
>>>
>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
>>>> Here's a little more info - it's for Cygwin, but I don't see a

Re: [OMPI users] problems with the -xterm option

2011-04-28 Thread jody

Hi Ralph

Is there an easy way i could modify the OpenMPI code so that it would use
the -Y option for ssh when connecting to remote machines?

Thank You
   Jody

On Thu, Apr 7, 2011 at 4:01 PM, jody  wrote:
> Hi Ralph
> thank you for your suggestions. After some fiddling, i found that after my
> last update (gentoo) my sshd_config had been overwritten
> (X11Forwarding was set to 'no').
>
> After correcting that, i can now open remote terminals with 'ssh -Y'
> and with 'ssh -X'
> (but with '-X' is till get those xauth warnings)
>
> But the xterm option still doesn't work:
>  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
> printenv | grep WORLD_RANK
>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>  OMPI_COMM_WORLD_RANK=0
>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
> [sd = 8]
>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
> lifeline [[54132,0],0] lost
>
> So it looks like the two processes from squid_0 can't open the display this 
> way,
> but one of them writes the output to the console...
> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
> DISPLAY variable is set to 'localhost:10.0'
>
> So in what way would OMPI have to be adapted, so -xterm would work?
>
> Thank You
>  Jody
>
> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
>> Here's a little more info - it's for Cygwin, but I don't see anything
>> Cygwin-specific in the answers:
>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>
>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>
>> Sorry Jody - I should have read your note more carefully to see that you
>> already tried -Y. :-(
>> Not sure what to suggest...
>>
>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>
>> Like I said, I'm not expert. However, a quick "google" of revealed this
>> result:
>>
>> When trying to set up x11 forwarding over an ssh session to a remote server
>> with the -X switch, I was getting an error like Warning: No xauth
>> data; using fake authentication data for X11 forwarding.
>>
>> When doing something like:
>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>> got an error message like:
>>
>>
>> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>> [root@RHEL ~]#
>> and any X programs I ran would not display on my local system..
>>
>> Turns out the solution is to use the -Y switch instead.
>>
>> ssh -Yl root 10.1.1.9
>>
>> and that worked fine.
>>
>> See if that works for you - if it does, we may have to modify OMPI to
>> accommodate.
>>
>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>
>> Hi Ralph
>> No, after the above error message mpirun has exited.
>>
>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>
>>  jody@chefli ~/share/neander $ ssh -Y squid_0
>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ xterm
>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>  jody@squid_0 ~ $ exit
>>  logout
>>
>> same thing with ssh -X, but here i get the same warning/error message
>> as with mpirun:
>>
>>  jody@chefli ~/share/neander $ ssh -X squid_0
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>> generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>
>> So perhaps the whole problem is linked to that xauth-thing.
>> Do you have a suggestion how this can be solved?
>>
>>

Re: [OMPI users] problems with the -xterm option

2011-04-07 Thread jody

Hi Ralph
thank you for your suggestions. After some fiddling, i found that after my
last update (gentoo) my sshd_config had been overwritten
(X11Forwarding was set to 'no').

After correcting that, i can now open remote terminals with 'ssh -Y'
and with 'ssh -X'
(but with '-X' is till get those xauth warnings)

But the xterm option still doesn't work:
  jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
printenv | grep WORLD_RANK
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
  OMPI_COMM_WORLD_RANK=0
  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
lifeline [[54132,0],0] lost

So it looks like the two processes from squid_0 can't open the display this way,
but one of them writes the output to the console...
Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
DISPLAY variable is set to 'localhost:10.0'

So in what way would OMPI have to be adapted, so -xterm would work?

Thank You
  Jody

On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain  wrote:
> Here's a little more info - it's for Cygwin, but I don't see anything
> Cygwin-specific in the answers:
> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>
> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>
> Sorry Jody - I should have read your note more carefully to see that you
> already tried -Y. :-(
> Not sure what to suggest...
>
> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>
> Like I said, I'm not expert. However, a quick "google" of revealed this
> result:
>
> When trying to set up x11 forwarding over an ssh session to a remote server
> with the -X switch, I was getting an error like Warning: No xauth
> data; using fake authentication data for X11 forwarding.
>
> When doing something like:
> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
> got an error message like:
>
>
> jason@badman ~/bin $ ssh -Xl root 10.1.1.9
> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
> Warning: No xauth data; using fake authentication data for X11 forwarding.
> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
> [root@RHEL ~]#
> and any X programs I ran would not display on my local system..
>
> Turns out the solution is to use the -Y switch instead.
>
> ssh -Yl root 10.1.1.9
>
> and that worked fine.
>
> See if that works for you - if it does, we may have to modify OMPI to
> accommodate.
>
> On Apr 6, 2011, at 9:19 AM, jody wrote:
>
> Hi Ralph
> No, after the above error message mpirun has exited.
>
> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>
>  jody@chefli ~/share/neander $ ssh -Y squid_0
>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>  jody@squid_0 ~ $ xterm
>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>  jody@squid_0 ~ $ exit
>  logout
>
> same thing with ssh -X, but here i get the same warning/error message
> as with mpirun:
>
>  jody@chefli ~/share/neander $ ssh -X squid_0
>  Warning: untrusted X11 forwarding setup failed: xauth key data not
> generated
>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>
> So perhaps the whole problem is linked to that xauth-thing.
> Do you have a suggestion how this can be solved?
>
> Thank You
>  Jody
>
> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
>
> If I read your error messages correctly, it looks like mpirun is crashing -
> the daemon is complaining that it lost the socket connection back to mpirun,
> and hence will abort.
>
> Are you seeing mpirun still alive?
>
>
> On Apr 5, 2011, at 4:46 AM, jody wrote:
>
> Hi
>
> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>
> it works in "text-mode":
>
>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>
>  OMPI_COMM_WORLD_RANK=0
>
>  OMPI_COMM_WOR

Re: [OMPI users] problems with the -xterm option

2011-04-06 Thread jody

Hi Ralph
No, after the above error message mpirun has exited.

But i also noticed that it is to ssh into squid_0 and open a xterm there:

  jody@chefli ~/share/neander $ ssh -Y squid_0
  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display: 130.60.126.74:0.0
  jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
  jody@squid_0 ~ $ xterm
  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
  jody@squid_0 ~ $ exit
  logout

same thing with ssh -X, but here i get the same warning/error message
as with mpirun:

  jody@chefli ~/share/neander $ ssh -X squid_0
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh

So perhaps the whole problem is linked to that xauth-thing.
Do you have a suggestion how this can be solved?

Thank You
  Jody

On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain  wrote:
> If I read your error messages correctly, it looks like mpirun is crashing - 
> the daemon is complaining that it lost the socket connection back to mpirun, 
> and hence will abort.
>
> Are you seeing mpirun still alive?
>
>
> On Apr 5, 2011, at 4:46 AM, jody wrote:
>
>> Hi
>>
>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>> it works in "text-mode":
>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>  OMPI_COMM_WORLD_RANK=0
>>  OMPI_COMM_WORLD_RANK=1
>>  OMPI_COMM_WORLD_RANK=2
>>  OMPI_COMM_WORLD_RANK=3
>>
>> but when i use  the -xterm option to mpirun, it doesn't work
>>
>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep 
>> WORLD_RANK
>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>  OMPI_COMM_WORLD_RANK=0
>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>> [sd = 8]
>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>> lifeline [[55607,0],0] lost
>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>
>> (strange: somebody wrote his message to the console)
>>
>> No matter whether i set the DISPLAY variable to the full hostname of
>> the workstation,
>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>
>> But i do have xauth data (as far as i know):
>> On the remote (squid_0):
>>  jody@squid_0 ~ $ xauth list
>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>
>> on the workstation:
>>  $ xauth list
>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>> 146c7f438fab79deb8a8a7df242b6f4b
>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>
>> In sshd_config on the workstation i have 'X11Forwarding yes'
>> I have also done
>>   xhost + squid_0
>> on the workstation.
>>
>>
>> How can i get the -xterm option running?
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] problems with the -xterm option

2011-04-05 Thread jody

Hi

On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
it works in "text-mode":
  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
  OMPI_COMM_WORLD_RANK=0
  OMPI_COMM_WORLD_RANK=1
  OMPI_COMM_WORLD_RANK=2
  OMPI_COMM_WORLD_RANK=3

but when i use  the -xterm option to mpirun, it doesn't work

 $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep WORLD_RANK
  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
  Warning: No xauth data; using fake authentication data for X11 forwarding.
  OMPI_COMM_WORLD_RANK=0
  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
[sd = 8]
  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
lifeline [[55607,0],0] lost
  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0

(strange: somebody wrote his message to the console)

No matter whether i set the DISPLAY variable to the full hostname of
the workstation,
to the IP-Adress of the workstation or simply to ":0.0", it doesn't work

But i do have xauth data (as far as i know):
On the remote (squid_0):
  jody@squid_0 ~ $ xauth list
  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b

on the workstation:
  $ xauth list
  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
146c7f438fab79deb8a8a7df242b6f4b
  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b

In sshd_config on the workstation i have 'X11Forwarding yes'
I have also done
   xhost + squid_0
on the workstation.


How can i get the -xterm option running?

Thank You
  Jody

Re: [OMPI users] WRF Problem running in Parallel

2011-02-22 Thread jody

Hi
At a first glance i would say this is not a OpenMPI problem,
but a wrf problem (though io must admit i have no knowledge whatsoever ith wrf)

Have you tried running a single instance of wrf.exe?
Have you tried to run a simple application (like a "hello world") on your nodes?

Jody


On Tue, Feb 22, 2011 at 7:37 AM, Ahsan Ali  wrote:
> Hello,
>  I an stuck in a problem that is regarding the running for Weather research
> and Forecasting Model (WRFV 3.2.1). I get the following error while running
> with mpirun. Any help would be highly appreciated.
>
> [pmdtest@pmd02 em_real]$ mpirun -np 4 wrf.exe
> starting wrf task 0 of 4
> starting wrf task 1 of 4
> starting wrf task 3 of 4
> starting wrf task 2 of 4
> --
> mpirun noticed that process rank 3 with PID 6044 on node pmd02.pakmet.com
> exited on signal 11 (Segmentation fault).
>
>
>
> --
> Syed Ahsan Ali Bokhari
> Electronic Engineer (EE)
> Research & Development Division
> Pakistan Meteorological Department H-8/4, Islamabad.
> Phone # off  +92518358714
> Cell # +923155145014
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] calling a customized MPI_Allreduce with MPI_PACKED datatype

2011-02-06 Thread jody

Hi Massimo

Just to make sure: usually the MPI_ERR_TUNCATE error is caused by
buffer sizes that are too small.
Can  you verify that the buffers you are using are large enough to
hold the data they should receive?

Jody


On Sat, Feb 5, 2011 at 6:37 PM, Massimo Cafaro
 wrote:
> Dear all,
>
> in one of my C codes developed using Open MPI v1.4.3 I need to call 
> MPI_Allreduce() passing as sendbuf and recvbuf arguments two MPI_PACKED 
> arrays. The reduction requires my own MPI_User_function that needs to  
> MPI_Unpack() its first and second argument, process them and finally 
> MPI_Pack() the result in the second argument.
>
> I need to use MPI_Pack/MPI_Unpack because I am not able to create a derived 
> datatype, since many data I need to send are dynamically allocated.
> However, the code fails at runtime with the following message:
>
> An error occurred in MPI_Unpack
> on communicator MPI_COMM_WORLD
> MPI_ERR_TRUNCATE: message truncated
> MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>
> I have verified that, after unpacking the data in my own reduction function, 
> all of the data are wrong.
> Is this possible in MPI? I did not find anything on the "MPI reference Volume 
> 1" and "Using MPI"  that prevents this. This should just require using as 
> datatype MPI_PACKED in MPI_Allreduce() . However, searching on the web I did 
> not find any examples.
>
> Thank you in advance for any clue/suggestions/source code examples.
> This is driving me crazy now ;-(
>
> Massimo Cafaro
>
>
> -
>
> ***
>
>  Massimo Cafaro, Ph.D.                               Additional affiliations:
>  Assistant Professor                                      Euro-Mediterranean 
> Centre for Climate Change
>  Dept. of Engineering for Innovation          SPACI Consortium
>  University of Salento, Lecce, Italy             E-mail 
> massimo.caf...@unisalento.it
>  Via per Monteroni                                                     
> massimo.caf...@cmcc.it
>  73100 Lecce, Italy                                                           
>           caf...@ieee.org
>  Voice/Fax  +39 0832 297371                                                   
>   caf...@acm.org
>  Web     http://sara.unisalento.it/~cafaro
>
>
> ***
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] heterogenous cluster

2011-02-02 Thread jody

Thaks all

I did the simple copying of the 32Bit applications and now it works.

Thanks
  Jody

On Wed, Feb 2, 2011 at 5:47 PM, David Mathog  wrote:
> jody  wrote:
>
>> How can i force OpenMPI to be built as a 32Bit application on a 64Bit
> machine?
>
> THe easiest way is not to - just copy over a build from a 32 bit
> machine, it will run on your 64 bit machine if the proper 32 bit
> libraries have been installed there.  Otherwise,  you need to put -m32
> on the gcc commmand lines.  Generally one does that by something like:
>
>  export CFLAGS=-m32
>
> before running configure to generate Makefiles.
>
> Regards,
>
>
> David Mathog
> mat...@caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] heterogenous cluster

2011-02-02 Thread jody

Thanks for your reply.

If i try your suggestion, every process fails with the following message:

*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[aim-triops:15460] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!

I think this is caused by the fact that on the 64Bit machine Open MPI
is also built as a 64 bit application.
How can i force OpenMPI to be built as a 32Bit application on a 64Bit machine?

Thank You
Jody

On Tue, Feb 1, 2011 at 9:00 PM, David Mathog  wrote:
>
>> I have sofar used a homogenous 32-bit cluster.
>> Now i have added a new machine which is 64 bit
>>
>> This means i have to reconfigure open MPI with
> `--enable-heterogeneous`, right?
>
> Not necessarily.  If you don't need the 64bit capabilities you could run
> 32 bit binaries along with a 32 bit version of OpenMPI.  At least that
> approach has worked so far for me.
>
> Regards,
>
> David Mathog
> mat...@caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] heterogenous cluster

2011-02-01 Thread jody

Hi

I have sofar used a homogenous 32-bit cluster.
Now i have added a new machine which is 64 bit

This means i have to reconfigure open MPI with `--enable-heterogeneous`, right?
Do i have to do this on every machine?
I don't remember all the option i had chosen when i first did the
configure - is there a way to find this out?

Thank You
  Jody

Re: [OMPI users] [SPAM:### 83%] problem when compiling ompenmpi V1.5.1

2010-12-16 Thread jody

Hi

if i rememmber correctly, "omp.h" is a header file for OpenMP which is
not the same as Open MPI.
So it looks like you have to install OpenMP,
Then you can compile it with the compiler option -fopenmp (in gcc)

Jody


On Thu, Dec 16, 2010 at 11:56 AM, Bernard Secher - SFME/LGLS
 wrote:
> I get the following error message when I compile openmpi V1.5.1:
>
>   CXX    otfprofile-otfprofile.o
> ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:11:18:
> erreur: omp.h : Aucun fichier ou dossier de ce type
> ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:
> In function ‘int main(int, const char**)’:
> ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:325:
> erreur: ‘omp_set_num_threads’ was not declared in this scope
> ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:460:
> erreur: ‘omp_get_thread_num’ was not declared in this scope
> ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:471:
> erreur: ‘omp_get_num_threads’ was not declared in this scope
>
> The compiler doesn't find the omp.h file.
> What happens ?
>
> Best
> Bernard
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Guaranteed run rank 0 on a given machine?

2010-12-12 Thread jody

In a similar situation i wrote a simple shell script "rankcreate.sh"
which creates a rank file assigning the various ranks to the correct
processors/slots when given a number of processes. In addition, this
script returns the name of this created rank file. I then use it like
this:

mpirun -np 5 --rankfile `rankcreate.sh 5` myApplication

May be this is of use for you

jody

On Fri, Dec 10, 2010 at 11:50 PM, Eugene Loh  wrote:
> David Mathog wrote:
>
>> Also, in my limited testing --host and -hostfile seem to be mutually
>> exclusive.
>>
> No.  You can use both together.  Indeed, the mpirun man page even has
> examples of this (though personally, I don't see having a use for this).  I
> think the idea was you might use a hostfile to define the nodes in your
> cluster and an mpirun command line that uses --host to select specific nodes
> from the file.
>
>> That is reasonable, but it isn't clear that it is intended.
>> Example, with a hostfile containing one entry for "monkey02.cluster
>> slots=1":
>>
>> mpirun  --host monkey01   --mca plm_rsh_agent rsh  hostname
>> monkey01.cluster
>>
>
> Okay.
>
>> mpirun  --host monkey02   --mca plm_rsh_agent rsh  hostname
>> monkey02.cluster
>>
>
> Okay.
>
>> mpirun  -hostfile /usr/common/etc/openmpi.machines.test1 \
>>  --mca plm_rsh_agent rsh  hostname
>> monkey02.cluster
>>
>
> Okay.
>
>> mpirun  --host monkey01  \
>>  -hostfile /usr/commom/etc/openmpi.machines.test1 \
>>  --mca plm_rsh_agent rsh  hostname
>> --
>> There are no allocated resources for the application  hostname
>> that match the requested mapping:
>>
>> Verify that you have mapped the allocated resources properly using the
>> --host or --hostfile specification.
>> --
>>
>
> Right.  Your hostfile has monkey02.  On the command line, you specify
> monkey01, but that's not in your hostfile.  That's a problem.  Just like on
> the mpirun man page.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] using totalview

2010-12-08 Thread jody

Hi
I am currently testing a demo version of totalview.
I am putting this question here, because the totalview
manual is very sparse on information about OpenMPI.

The first question is how to start totalview with mpirun.
I saw that mpirun has some inbuilt totalview capability.

  For debugging:

   -debug, --debug
  Invoketheuser-leveldebuggerindicatedby   the
  orte_base_user_debugger MCA parameter.

   -debugger, --debugger
  Sequence of debuggers to search for when --debug is  used  (i.e.
  a synonym for orte_base_user_debugger MCA parameter).

   -tv, --tv
  Launch processes under the TotalView debugger.  Deprecated back-
  wards compatibility flag. Synonym for --debug.

I tried 'mpirun -np 4 -tv HelloMPI' but that seemed to b debugging mpirun,
and i wasn't able to open the source window for HelloMPI.cpp.

I don't understand how the '--debug' option must be used;
in particular, i don't understand "user-level debugger indicated by
the orte_base_user_debugger MCA parameter."

Another question (which might be solved if i can correctly start up
totalview) concerns
the hostfile and rankfile parameters of mpirun: how can i start an
open mpi application with
totalview so that my application starts the processes on the correct
processors as
defined in hostfile and rankfile?

Thank You

Jody

Re: [OMPI users] message truncated error

2010-11-02 Thread jody

Hi Jack

> the buffersize is the same in two iterations.

this doesn't help if the message which is sent is larger than
buffersize in the second iteration.
But as David says, without the details of the message sending and
potential changes to the
receive buffer one can't make any precise diagnosis.

jody



On Mon, Nov 1, 2010 at 6:41 PM, Jack Bryan  wrote:
> thanks
> I use
> double* recvArray  = new double[buffersize];
> The receive buffer size
> MPI::COMM_WORLD.Recv(&(recvDataArray[0]), xVSize, MPI_DOUBLE, 0, mytaskTag);
> delete [] recvArray  ;
> In first iteration, the receiver works well.
> But, in second iteration ,
> I got the
> MPI_ERR_TRUNCATE: message truncated
> the buffersize is the same in two iterations.
>
> ANy help is appreciated.
> thanks
> Nov. 1 2010
>
>> Date: Mon, 1 Nov 2010 08:08:08 +0100
>> From: jody@gmail.com
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] message truncated error
>>
>> Hi Jack
>>
>> Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv
>> (or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in.
>> Check your code to make sure you assign enough memory to your buffers.
>>
>> regards
>> Jody
>>
>>
>> On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan  wrote:
>> > HI,
>> > In my MPI program, master send many msaages to another worker with the
>> > same
>> > tag.
>> > The worker uses
>> > s
>> > MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1,
>> > message_para_to_workers_type, 0, downStreamTaskTag);
>> > to receive the messages
>> > I got error:
>> >
>> > n36:94880] *** An error occurred in MPI_Recv
>> > [n36:94880] *** on communicator MPI_COMM_WORLD
>> > [n36:94880] *** MPI_ERR_TRUNCATE: message truncated
>> > [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> > [n36:94880] *** Process received signal ***
>> > [n36:94880] Signal: Segmentation fault (11)
>> > [n36:94880] Signal code: Address not mapped (1)
>> >
>> > Is this (the same tag) the reason for the errors ?
>> > ANy help is appreciated.
>> > thanks
>> > Jack
>> > Oct. 31 2010
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] link problem on 64bit platform

2010-11-02 Thread jody

HI
@trent
  no, i didn't use the other calls, because i think they are all the
same (on my installation they are all soft links to opal_wrapper)

@tim
  gentoo on 64 bit does have lib and lib64 directories for the
respective architectures (at / and at /usr)
  but in my 64-bit installation of openMPI there  is no lib64
directory, only a lib.
  I thought the naming of the internal directory structure of openMPI
would be determined by the installation
  (i.e. the `make install`) and not by the operating system.

@jeff
  i don't remember the particular CFLAGS or CXXFLAGS i had used, but i
have now rebuilt openMPI with
   ./configure CFLAGS=-m64 CXXFLAGS=-m64
--prefix=/opt/openmpi-1.4.2-64 --with-threads --disable-mpi-f77
--disable-mpi-f90
  and now the problem has been solved.

After something similar has then happened when trying to do 32bit
compilations, i think i found out what the original problem was:
I had first done a 64 bit installation of OpenMPI installed under
/opt/openmpi-1.4.2.
I later renamed this directory to /opt/openmpi-1.4.2-64, and installed
a 32bit version of OpenMPI in /opt/openmpi-1.4.2

Apparently, when i the tried to do a 64bit compilation, the linker
looked into the lib-directory with the *original* name
/opt/openmpi-1.4.2
instead of /opt/openmpi-1.4.2-64, so of course it only found the 32bit
libs of the newer installation.

To test this assumption i now renamed the 64-bit installation set my
/opt/openmpi link to the new directory and tried to compile:
jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp
Cannot open configuration file
/opt/openmpi-1.4.2-64/share/openmpi/mpiCC-wrapper-data.txt
Error parsing data file mpiCC: Not found

So again, it looked into the original installation directory of the
64-bit installation for some files

So i guess the basic question is:
  is it permitted to rename openMPI installations, and if yes how is
this porperly done (since a simple mv doesn't work)

Sorry about the imprecise question. Indeed, if i had looked exactly at
the original output, i should have noticed that
the linker was looking in the wrong directory.

Thank You
  Jody

Thanks anyway
  Jody

On Mon, Nov 1, 2010 at 1:52 PM, Tim Prince  wrote:
> On 11/1/2010 5:24 AM, Jeff Squyres wrote:
>>
>> On Nov 1, 2010, at 5:20 AM, jody wrote:
>>
>>> jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp
>>>
>>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld:
>>> skipping incompatible /opt/openmpi-1.4.2/lib/libmpi_cxx.so when
>>> searching for -lmpi_cxx
>>
>> This is the key message -- it found libmpi_cxx.so, but the linker deemed
>> it incompatible, so it skipped it.
>
> Typically, it means that the cited library is a 32-bit one, to which the
> 64-bit ld will react in this way.  You could have verified this by
> file /opt/openmpi-1.4.2/lib/*
> By normal linux conventions a directory named /lib/ as opposed to /lib64/
> would contain only 32-bit libraries.  If gentoo doesn't conform with those
> conventions, maybe you should do your learning on a distro which does.
>
> --
> Tim Prince
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] link problem on 64bit platform

2010-11-01 Thread jody

Hi
On a newly installed 64bit linux (2.6.32-gentoo-r7) with gcc version 4.4.4
i can't compile even simple Open-MPI applications (OpenMPI 1.4.2).

The message is:
jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp
/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld:
skipping incompatible /opt/openmpi-1.4.2/lib/libmpi_cxx.so when
searching for -lmpi_cxx
/usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld:
cannot find -lmpi_cxx
collect2: ld returned 1 exit status

I am using the 64bit mpiCC:
jody@aim-squid_0 ~/progs $ which mpiCC
/opt/openmpi/bin/mpiCC
jody@aim-squid_0 ~/progs $ ls -l /opt/openmpi
lrwxrwxrwx 1 root root 22 Nov  1 09:56 /opt/openmpi -> /opt/openmpi-1.4.2-64/

The mpi_cxx should be found in the lib subdirectory:
jody@aim-squid_0 ~/progs $ ls -l /opt/openmpi/lib/libmpi_cxx*
-rwxr-xr-x 1 root root   1073 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.la
lrwxrwxrwx 1 root root 19 Jun 24 15:50
/opt/openmpi/lib/libmpi_cxx.so -> libmpi_cxx.so.0.0.1
lrwxrwxrwx 1 root root 19 Jun 24 15:50
/opt/openmpi/lib/libmpi_cxx.so.0 -> libmpi_cxx.so.0.0.1
-rwxr-xr-x 1 root root 137442 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.so.0.0.1

PATH and LD_LIBRARY_PATH contain the correct paths:
jody@aim-squid_0 ~/progs $ echo $PATH
/opt/openmpi/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.4.4
jody@aim-squid_0 ~/progs $ echo $LD_LIBRARY_PATH
/opt/openmpi/lib:

AM i missing something?

Thank You
  jody

Re: [OMPI users] message truncated error

2010-11-01 Thread jody

Hi Jack

Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv
(or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in.
Check your code to make sure you assign enough memory to your buffers.

regards
Jody


On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan  wrote:
> HI,
> In my MPI program, master send many msaages to another worker with the same
> tag.
> The worker uses
> s
> MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1,
> message_para_to_workers_type, 0, downStreamTaskTag);
> to receive the messages
> I got error:
>
> n36:94880] *** An error occurred in MPI_Recv
> [n36:94880] *** on communicator MPI_COMM_WORLD
> [n36:94880] *** MPI_ERR_TRUNCATE: message truncated
> [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [n36:94880] *** Process received signal ***
> [n36:94880] Signal: Segmentation fault (11)
> [n36:94880] Signal code: Address not mapped (1)
>
> Is this (the same tag) the reason for the errors ?
> ANy help is appreciated.
> thanks
> Jack
> Oct. 31 2010
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Help with a strange error

2010-10-28 Thread jody

Hi
I have a (rather complex) OpenMPI application which works nicely.
In the main file i have the function main(), in which MPI_Comm_size()
and MPI_Comm_rank() are being called.

However, when i add a function check() to the main file, process 0 will
crash in PMPI_Comm_size(), even when the function check() is not called!
All other processes hang inside PMPI_Init().
The crash also occurs when the function check() is written after the
function main

The gdb stack trace for process 0:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208715568 (LWP 10072)]
0x0016cb16 in PMPI_Comm_size () from /opt/openmpi/lib/libmpi.so.0
Current language:  auto; currently c
(gdb) where
#0  0x0016cb16 in PMPI_Comm_size () from /opt/openmpi/lib/libmpi.so.0
#1  0x080c379d in main (iArgC=14, apArgV=0xbfc60bc4) at TDMain.cpp:22
Missing separate debuginfos, use: debuginfo-install gcc.i386 zlib.i386
(gdb)

I am using OpenMPI 1.4.2

Has anybody got an idea how i could find the problem?
Thank You
  Jody

Re: [OMPI users] Using hostfile with default hostfile

2010-10-27 Thread jody

Where is the option 'default-hostfile' described?
It does not appear in mpirun's man page (for v. 1.4.2)
and i couldn't find anything like that with googling.

Jody

On Wed, Oct 27, 2010 at 4:02 PM, Ralph Castain  wrote:
> Specify your hostfile as the default one:
>
> mpirun --default-hostfile ./Cluster.hosts
>
> Otherwise, we take the default hostfile and then apply the hostfile as a 
> filter to select hosts from within it. Sounds strange, I suppose, but the 
> idea is that the default hostfile can contain configuration info (#sockets, 
> #cores/socket, etc.) that you might not want to have to put in every hostfile.
>
>
> On Oct 27, 2010, at 7:51 AM, Stefan Kuhne wrote:
>
>> Hello,
>>
>> my Cluster has a configured default hostfile.
>>
>> When i use another hostfile for one job i get:
>>
>> cluster-admin@Head:~/Cluster/hello$ mpirun --hostfile ../Cluster.hosts
>> ./hello
>> --
>> There are no allocated resources for the application
>>  ./hello
>> that match the requested mapping:
>>  ../Cluster.hosts
>>
>> Verify that you have mapped the allocated resources properly using the
>> --host or --hostfile specification.
>>
>> ...
>>
>> Any ideas for it?
>>
>> Regards,
>> Stefan Kuhne
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Running simple MPI program

2010-10-23 Thread jody

Hi Brandon
Does it work if you try this:
  mpirun -np 2 hostfile hosts.txt ilk

(see http://www.open-mpi.org/faq/?category=running#simple-spmd-run)

jody

On Sat, Oct 23, 2010 at 4:07 PM, Brandon Fulcher  wrote:
> Thank you for the response!
>
> The code runs on my own machine as well.  Both machines, in fact.  And I did
> not build MPI but installed the package from the ubuntu repositories.
>
> The problem occurs when I try to run a job using two machines or simply try
> to run it on a slave from the master.
>
> the actual command I have run along with the output is below:
>
> mpirun -hostfile hosts.txt ilk
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> where hosts.txt contains:
> 192.168.0.2 cpu=2
> 192.168.0.6 cpu=1
>
>
> If it matters the same output is given if I define a remote host in the
> command such as (if I am on 192.168.0.2)
> mpirun  -host 192.168.0.6 ilk
>
> Now if I run it locally, the job succeeds.  This works from either cpu.
> mpirun  ilk
>
>
> Thanks in advance.
>
> On Fri, Oct 22, 2010 at 11:59 PM, David Zhang  wrote:
>>
>> since you said you're new to MPI, what command did you use to run the 2
>> processes?
>>
>> On Fri, Oct 22, 2010 at 9:58 PM, David Zhang 
>> wrote:
>>>
>>> your code works on mine machine. could be they way you build mpi.
>>>
>>> On Fri, Oct 22, 2010 at 7:26 PM, Brandon Fulcher 
>>> wrote:
>>>>
>>>> Hi, I am completely new to MPI and am having trouble running a job
>>>> between two  cpus.
>>>>
>>>> The same thing happens no matter what MPI job I try to run, but here is
>>>> a simple 'hello world' style program I am trying to run.
>>>>
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>>   int *buf, i, rank, nints, len;
>>>>   char hostname[256];
>>>>
>>>>   MPI_Init(&argc,&argv);
>>>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>   gethostname(hostname,255);
>>>>   printf("Hello world!  I am process number: %d on host %s\n", rank,
>>>> hostname);
>>>>   MPI_Finalize();
>>>>   return 0;
>>>> }
>>>>
>>>>
>>>> On either CPU, I can successfully compile and run, but when trying to
>>>> run the program using two CPUS it fails with this output:
>>>>
>>>>
>>>> --
>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>>
>>>> --
>>>>
>>>>
>>>> With no additional information or errors,  What can I do to go about
>>>> finding out what is wrong?
>>>>
>>>>
>>>>
>>>> I have read the FAQ and followed the instructions.  I can ssh into the
>>>> slave without entering a password and have the libraries installed on both
>>>> machines.
>>>>
>>>> The only thing pertinent I could find is this faq
>>>> http://www.open-mpi.org/faq/?category=running#missing-prereqs  but I do not
>>>> know if it applies since I have installed open mpi from the Ubuntu
>>>> repositories and assume the libraries are correctly set.
>>>>
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> David Zhang
>>> University of California, San Diego
>>
>>
>>
>> --
>> David Zhang
>> University of California, San Diego
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Question about MPI_Barrier

2010-10-21 Thread jody

Hi

I don't know the reason for the strange behaviour, but anyway,
to measure time in an MPI application you should use MPI_Wtime(), not clock()

regards
  jody

On Wed, Oct 20, 2010 at 11:51 PM, Storm Zhang  wrote:
> Dear all,
>
> I got confused with my recent C++ MPI program's behavior. I have an MPI
> program in which I use clock() to measure the time spent between to
> MPI_Barrier, just like this:
>
> MPI::COMM_WORLD.Barrier();
> if if(rank == master) t1 = clock();
> "code A";
> MPI::COMM_WORLD.Barrier();
> if if(rank == master) t2 = clock();
> "code B";
>
> I need to measure t2-t1 to see the time spent on the code A between these
> two MPI_Barriers. I notice that if I comment code B, the time seems much
> less the original time (almost half). How does it happen? What is a possible
> reason for it? I have no idea.
>
> Thanks for your help.
>
> Linbao
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] my leak or OpenMPI's leak?

2010-10-18 Thread jody

But shouldn't something like this show up in the other processes as well?
I only see that in the master process, but the slave processes also
send data to each other and to the master.


On Mon, Oct 18, 2010 at 2:48 PM, Ralph Castain  wrote:
>
> On Oct 18, 2010, at 1:41 AM, jody wrote:
>
>> I had this leak with OpenMPI 1.4.2
>>
>> But in my case, there is no accumulation - when i repeat the same call,
>> no additional leak is reported for the second call
>
> That's because it grabs a larger-than-required chunk of memory just in case 
> you call again. This helps performance by reducing the number of malloc's in 
> your application.
>
>
>>
>> Jody
>>
>> On Mon, Oct 18, 2010 at 1:57 AM, Ralph Castain  wrote:
>>> There is no OMPI 2.5 - do you mean 1.5?
>>>
>>> On Oct 17, 2010, at 4:11 PM, Brian Budge wrote:
>>>
>>>> Hi Jody -
>>>>
>>>> I noticed this exact same thing the other day when I used OpenMPI v
>>>> 2.5 built with valgrind support.  I actually ran out of memory due to
>>>> this.  When I went back to v 2.43, my program worked fine.
>>>>
>>>> Are you also using 2.5?
>>>>
>>>>  Brian
>>>>
>>>> On Wed, Oct 6, 2010 at 4:32 AM, jody  wrote:
>>>>> Hi
>>>>> I regularly use valgrind to check for leaks, but i ignore the leaks
>>>>> clearly created by OpenMPI,
>>>>> because i think most of them happen because of efficiency (lose no
>>>>> time cleaning up unimportant leaks).
>>>>> But i want to make sure no leaks come from my own apps.
>>>>> In most of the cases, leaks i am responsible for have the name of one
>>>>> of my files at the bottom of the stack printed by valgrind,
>>>>> and no internal OpenMPI-calls above, whereas leaks clearly caused by
>>>>> OpenMPI have something like
>>>>> ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the 
>>>>> bottom.
>>>>>
>>>>> Now i have an application where i am completely unsure where the
>>>>> responsibility for a particular leak lies. valgrind  shows (among
>>>>> others) this report
>>>>>
>>>>> ==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are
>>>>> definitely lost in loss record 2,033 of 2,036
>>>>> ==2756==    at 0x4005943: malloc (vg_replace_malloc.c:195)
>>>>> ==2756==    by 0x4049387: ompi_free_list_grow (in
>>>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
>>>>> ==2756==    by 0x41CA613: ???
>>>>> ==2756==    by 0x41BDD91: ???
>>>>> ==2756==    by 0x41B0C3D: ???
>>>>> ==2756==    by 0x408AC9C: PMPI_Send (in
>>>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
>>>>> ==2756==    by 0x8123377: ConnectorBase::send(CollectionBase*,
>>>>> std::pair,
>>>>> std::pair >&) (ConnectorBase.cpp:39)
>>>>> ==2756==    by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36)
>>>>> ==2756==    by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226)
>>>>> ==2756==    by 0x80C167B: main (TDMain.cpp:24)
>>>>> ==2756==
>>>>>
>>>>> At a first glimpse it looks like an OpenMPI-internal leak,
>>>>> because it happens iinside PMPI_Send,
>>>>> but then i am using the function ConnectorBase::send()
>>>>> several times from other callers than TileConnector,
>>>>> but these don't show up in valgrind's output.
>>>>>
>>>>> Does anybody have an idea what is happening here?
>>>>>
>>>>> Thank You
>>>>> jody
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] my leak or OpenMPI's leak?

2010-10-18 Thread jody

I had this leak with OpenMPI 1.4.2

But in my case, there is no accumulation - when i repeat the same call,
no additional leak is reported for the second call

Jody

On Mon, Oct 18, 2010 at 1:57 AM, Ralph Castain  wrote:
> There is no OMPI 2.5 - do you mean 1.5?
>
> On Oct 17, 2010, at 4:11 PM, Brian Budge wrote:
>
>> Hi Jody -
>>
>> I noticed this exact same thing the other day when I used OpenMPI v
>> 2.5 built with valgrind support.  I actually ran out of memory due to
>> this.  When I went back to v 2.43, my program worked fine.
>>
>> Are you also using 2.5?
>>
>>  Brian
>>
>> On Wed, Oct 6, 2010 at 4:32 AM, jody  wrote:
>>> Hi
>>> I regularly use valgrind to check for leaks, but i ignore the leaks
>>> clearly created by OpenMPI,
>>> because i think most of them happen because of efficiency (lose no
>>> time cleaning up unimportant leaks).
>>> But i want to make sure no leaks come from my own apps.
>>> In most of the cases, leaks i am responsible for have the name of one
>>> of my files at the bottom of the stack printed by valgrind,
>>> and no internal OpenMPI-calls above, whereas leaks clearly caused by
>>> OpenMPI have something like
>>> ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the bottom.
>>>
>>> Now i have an application where i am completely unsure where the
>>> responsibility for a particular leak lies. valgrind  shows (among
>>> others) this report
>>>
>>> ==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are
>>> definitely lost in loss record 2,033 of 2,036
>>> ==2756==    at 0x4005943: malloc (vg_replace_malloc.c:195)
>>> ==2756==    by 0x4049387: ompi_free_list_grow (in
>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
>>> ==2756==    by 0x41CA613: ???
>>> ==2756==    by 0x41BDD91: ???
>>> ==2756==    by 0x41B0C3D: ???
>>> ==2756==    by 0x408AC9C: PMPI_Send (in
>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
>>> ==2756==    by 0x8123377: ConnectorBase::send(CollectionBase*,
>>> std::pair,
>>> std::pair >&) (ConnectorBase.cpp:39)
>>> ==2756==    by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36)
>>> ==2756==    by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226)
>>> ==2756==    by 0x80C167B: main (TDMain.cpp:24)
>>> ==2756==
>>>
>>> At a first glimpse it looks like an OpenMPI-internal leak,
>>> because it happens iinside PMPI_Send,
>>> but then i am using the function ConnectorBase::send()
>>> several times from other callers than TileConnector,
>>> but these don't show up in valgrind's output.
>>>
>>> Does anybody have an idea what is happening here?
>>>
>>> Thank You
>>> jody
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] connecting to MPI from outside

2010-10-12 Thread jody

Hi Mahesh

At least in simple cases you can use normal socket functions for this.

I used this in order to change the run-time behaviour of an application
of a master-worker MPI application. I implemented a simple TCP-Server
which runs in a separate thread on the Master processor; connecting to
this server i could then send commands which changed the state of  the
master.

Jody

On Tue, Oct 12, 2010 at 6:14 AM, Mahesh Salunkhe
 wrote:
>
> Hello,
>   Could you pl tell me how to connect a client(not in any mpi group )  to a
>    process in a mpi group.
>    (i.e.  just like we do in socket programming by using connect( ) call).
>   Does mpi provide any call for accepting connections from outside
>    processes?
>
> --
> Regards
> Mahesh
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] my leak or OpenMPI's leak?

2010-10-06 Thread jody

Hi
I regularly use valgrind to check for leaks, but i ignore the leaks
clearly created by OpenMPI,
because i think most of them happen because of efficiency (lose no
time cleaning up unimportant leaks).
But i want to make sure no leaks come from my own apps.
In most of the cases, leaks i am responsible for have the name of one
of my files at the bottom of the stack printed by valgrind,
and no internal OpenMPI-calls above, whereas leaks clearly caused by
OpenMPI have something like
ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the bottom.

Now i have an application where i am completely unsure where the
responsibility for a particular leak lies. valgrind  shows (among
others) this report

==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are
definitely lost in loss record 2,033 of 2,036
==2756==at 0x4005943: malloc (vg_replace_malloc.c:195)
==2756==by 0x4049387: ompi_free_list_grow (in
/opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
==2756==by 0x41CA613: ???
==2756==by 0x41BDD91: ???
==2756==by 0x41B0C3D: ???
==2756==by 0x408AC9C: PMPI_Send (in
/opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2)
==2756==by 0x8123377: ConnectorBase::send(CollectionBase*,
std::pair,
std::pair >&) (ConnectorBase.cpp:39)
==2756==by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36)
==2756==by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226)
==2756==by 0x80C167B: main (TDMain.cpp:24)
==2756==

At a first glimpse it looks like an OpenMPI-internal leak,
because it happens iinside PMPI_Send,
but then i am using the function ConnectorBase::send()
several times from other callers than TileConnector,
but these don't show up in valgrind's output.

Does anybody have an idea what is happening here?

Thank You
jody

Re: [OMPI users] a question about [MPI]IO on systems without network filesystem

2010-09-29 Thread jody

Hi Paul

> Is it possible to configure/run OpenMPI in a such way, that only _one_
> process (e.g. master) performs real disk I/O, and other processes sends the
> data to the master which works as an agent?

It is possible to run OpenMPI this way, but it is not a matter of configuration,
but of implementation alone.

> Of course this would impacts the performance, because all data must be send
> over network, and the master may became a bottleneck. But is such scenario -
> IO of all processes bundled to one  process - practicable at all?

I think this question can only be answered by trying, because it
depends strongly
on the volume of your messages and the quality of your hardware
(network and disk speed)

Jody

Re: [OMPI users] Thread as MPI process

2010-09-21 Thread jody

Hi
I don't know if i correctly understand what you need, but have you
already tried  MPI_Comm_spawn?

Jody

On Mon, Sep 20, 2010 at 11:24 PM, Mikael Lavoie  wrote:
> Hi,
>
> I wanna know if it exist a implementation that permit to run a single host
> process on the master of the cluster, that will then spawn 1 process per -np
> X defined thread at the host specified in the host list. The host will then
> act as a syncronized sender/collecter of the work done.
>
> It would really be the saint-graal of the MPI implementation to me, for the
> use i wanna make of it.
>
> So i wait your answer, hoping that this exist,
>
> Mikael Lavoie
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI_Reduce performance

2010-09-09 Thread jody

Hi
@Ashley:
What is the exact semantics of an asynchronous barrier,
and is it part of the MPI specs?

Thanks
  Jody

On Thu, Sep 9, 2010 at 9:34 PM, Ashley Pittman  wrote:
>
> On 9 Sep 2010, at 17:00, Gus Correa wrote:
>
>> Hello All
>>
>> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary 
>> words, may be part of a larger context of load balance, or not?
>>
>> Would Ashley's recipe of sporadic barriers be a silver bullet to
>> improve load imbalance problems, regardless of which collectives or
>> even point-to-point calls are in use?
>
> No, it only holds where there is no data dependency between some of the 
> ranks, in particular if there are any non-rooted collectives in an iteration 
> of your code then it cannot make any difference at all, likewise if you have 
> a reduce followed by a barrier using the same root for example then you 
> already have global synchronisation each iteration and it won't help.  My 
> feeling is that it applies to a significant minority of problems, certainly 
> the phrase "adding barriers can make codes faster" should be textbook stuff 
> if it isn't already.
>
>> Would sporadic barriers in the flux coupler "shake up" these delays?
>
> I don't fully understand your description but it sounds like it might set the 
> program back to a clean slate which would give you per-iteraion delays only 
> rather than cumulative or worse delays.
>
>> Ashley:  How did you get to the magic number of 25 iterations for the
>> sporadic barriers?
>
> Experience and finger in the air.  The major factors in picking this number 
> is the likelihood of a positives feedback cycle of delays happening, the 
> delays these delays add and the cost of a barrier itself.  Having too low a 
> value will slightly reduce performance, having too high a value can 
> drastically reduce performance.
>
> As a further item (because I like them) the asynchronous barrier is even 
> better again if used properly, in the good case it doesn't cause any process 
> to block ever so the cost is only that of the CPU cycles the code takes 
> itself, in the bad case where it has to delay a rank then this tends to have 
> a positive impact on performance.
>
>> Would it be application/communicator pattern dependent?
>
> Absolutely.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OpenMPI Segmentation fault (11)

2010-07-26 Thread jody

Hi Jack

Yes to both questions. Best to download it directly from their page:
  http://www.valgrind.org/downloads/current.html
then you are sure to get the newest version.

Another way to manage your output is to use the '-output-filename' of
mpirun (or mpiexec)
which will redirect the outputs (stdout, stderr and stddiag) of you processors
into separate text files - check the man pages for 'mpirun'

If you don't need to see the output of all your processes, but still want
to use xterminals,  you can use the '-xterm' option of
mpirun, where you can select which ranks should open an xterm.
(Again check the man pages of mpirun)

Jody


On Mon, Jul 26, 2010 at 8:55 AM, Jack Bryan  wrote:
> Thanks
> It can be installed on linux and work with gcc ?
> If I have many processes, such as 30, I have to open 30 terminal windows ?
> thanks
> Jack
>
>> Date: Mon, 26 Jul 2010 08:23:57 +0200
>> From: jody@gmail.com
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] OpenMPI Segmentation fault (11)
>>
>> Hi Jack
>>
>> Have you tried to run your aplication under valgrind?
>> Even though applications generallay run slower under valgrind,
>> it may detect memory errors before the actual crash happens.
>>
>> The best would be to start a terminal window for each of your processes
>> so you can see valgrind's output for each process separately.
>>
>> Jody
>>
>> On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan 
>> wrote:
>> > Dear All,
>> > I run a 6 parallel processes on OpenMPI.
>> > When the run-time of the program is short, it works well.
>> > But, if the run-time is long, I got errors:
>> > [n124:45521] *** Process received signal ***
>> > [n124:45521] Signal: Segmentation fault (11)
>> > [n124:45521] Signal code: Address not mapped (1)
>> > [n124:45521] Failing at address: 0x44
>> > [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0]
>> > [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60]
>> > [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19]
>> > [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa]
>> > [n124:45521] [ 4] /home/path/exec [0x40ec9a]
>> > [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4)
>> > [0x3c5021d974]
>> > [n124:45521] [ 6] /home/path/exec [0x401139]
>> > [n124:45521] *** End of error message ***
>> > It seems that there may be some problems about memory management.
>> > But, I cannot find the reason.
>> > My program needs to write results to some files.
>> > If I open the files too many without closing them, I may get the above
>> > errors.
>> > But, I have removed the writing files from my program.
>> > The problem appears again when the program runs longer time.
>> > Any help is appreciated.
>> > Jack
>> > July 25  2010
>> >
>> > 
>> > Hotmail is redefining busy with tools for the New Busy. Get more from
>> > your
>> > inbox. See how.
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
> Hotmail. Get busy.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OpenMPI Segmentation fault (11)

2010-07-26 Thread jody

Hi Jack

Have you tried to run your aplication under valgrind?
Even though applications generallay run slower under valgrind,
it may detect memory errors before the actual crash happens.

The best would be to start a terminal window for each of your processes
so you can see valgrind's output for each process separately.

Jody

On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan  wrote:
> Dear All,
> I run a 6 parallel processes on OpenMPI.
> When the run-time of the program is short, it works well.
> But, if the run-time is long, I got errors:
> [n124:45521] *** Process received signal ***
> [n124:45521] Signal: Segmentation fault (11)
> [n124:45521] Signal code: Address not mapped (1)
> [n124:45521] Failing at address: 0x44
> [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0]
> [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60]
> [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19]
> [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa]
> [n124:45521] [ 4] /home/path/exec [0x40ec9a]
> [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974]
> [n124:45521] [ 6] /home/path/exec [0x401139]
> [n124:45521] *** End of error message ***
> It seems that there may be some problems about memory management.
> But, I cannot find the reason.
> My program needs to write results to some files.
> If I open the files too many without closing them, I may get the above
> errors.
> But, I have removed the writing files from my program.
> The problem appears again when the program runs longer time.
> Any help is appreciated.
> Jack
> July 25  2010
>
> 
> Hotmail is redefining busy with tools for the New Busy. Get more from your
> inbox. See how.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] hpw to log output of spawned processes

2010-07-13 Thread jody

Thanks for the patch - it works fine!

Jody

On Mon, Jul 12, 2010 at 11:38 PM, Ralph Castain  wrote:
> Just so you don't have to wait for 1.4.3 to be released, here is the patch.
> Ralph
>
>
>
>
> On Jul 12, 2010, at 2:44 AM, jody wrote:
>
>> yes, i'm using 1.4.2
>>
>> Thanks
>>  Jody
>>
>> On Mon, Jul 12, 2010 at 10:38 AM, Ralph Castain  wrote:
>>>
>>> On Jul 12, 2010, at 2:17 AM, jody wrote:
>>>
>>>> Hi
>>>>
>>>> I have a master process which spawns a number of workers of which i'd
>>>> like to  save the output in separate files.
>>>>
>>>> Usually i use the '-output-filename' option in such a situation.
>>>> However, if i do
>>>>  mpirun -np 1 -output-filename work_out master arg1 arg2
>>>> all the files work_out.1, work_out.2, ... are ok,
>>>> but work_out.0 contains both outputs of the master process(process 0
>>>> in COMM_WORLD) and
>>>> of the first worker (process 0 in the communicator of the spawned 
>>>> processes).
>>>
>>> Crud - that's a bug.
>>>
>>>>
>>>> I also tried the '-tag-output' option, but this involves several
>>>> additional steps,
>>>> because i have to separate the combined outputs
>>>>  mpirun -np 1 -tag-output  master arg1 arg2 > total.out
>>>>  grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out
>>>>  grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out
>>>>  grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out
>>>>  ...
>>>> Of course, this could be wrapped in a script,  but it is a bit cumbersome
>>>> (and i am not sure if the job-ids are always "1" and "2") ...
>>>>
>>>> Is there some simpler way to separate the output of the two streams?
>>>
>>> Not really.
>>>
>>>>
>>>> If not, would it be possible to extend the -output-filename option i
>>>> such a way that it
>>>> would also combine job-id and rank withe the output file:
>>>>  work_out.1.0
>>>> for the master's output, and
>>>>  work_out.2.0
>>>>  work_out.2.1
>>>>  work_out.2.2
>>>>  ...
>>>> for the worker's output?
>>>
>>> Yeah, I can do that - will put something together. Are you doing this in 
>>> the 1.4 series?
>>>
>>>>
>>>> Thank You
>>>>  Jody
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Dynamic process tutorials?

2010-07-12 Thread jody

Hi Brian

Generally it is possible to create new communicators from existing ones
(see for instance the various MPI_GROUP_* functions and MPI_COMM_CREATE)

> Also, how can you specify with MPI_Comm_spawn/multiple() how do you
> specify IP addresses on which to start the processes?
I haven't tried it yet with spawning, but i'd think this would also be
done by a rankfile

> I would prefer not to use any of the MPI command-line utilities
> (mpirun/mpiexec) if that's possible.
If you don't like command-line utilities, you can write some graphic tool
which will call mpirun or mpiexec. But somewhere you have to tell OpenMPI
what to run on how many processors etc.

I'd suggest you take a look at the "MPI-The Complete Reference" Vol I and II

Jody

On Mon, Jul 12, 2010 at 5:07 PM, Brian Budge  wrote:
> Hi Jody -
>
> Thanks for the reply.  is there a way of "fusing" intercommunicators?
> Let's say I have a higher level node scheduler, and it makes a new
> node available to a COMM that is already running.  So the master
> spawns another process for that node.  How can the new process
> communicate with the other already started processes?
>
> Also, how can you specify with MPI_Comm_spawn/multiple() how do you
> specify IP addresses on which to start the processes?
>
> If my higher level node scheduler needs to take away a process from my
> COMM, is it good/bad for that node to call MPI_Finalize as it exits?
>
> I would prefer not to use any of the MPI command-line utilities
> (mpirun/mpiexec) if that's possible.
>
> Thanks,
>  Brian
>
> On Sat, Jul 10, 2010 at 11:53 PM, jody  wrote:
>> Hi Brian
>> When you spawn processes with MPI_Comm_spawn(), one of the arguments
>> will be set to an intercommunicator of thes spawner and the spawnees.
>> You can use this intercommunicator as the communicator argument
>> in the MPI_functions.
>>
>> Jody
>> On Fri, Jul 9, 2010 at 5:56 PM, Brian Budge  wrote:
>>> Hi all -
>>>
>>> I've been looking at the dynamic process features of mpi-2.  I have managed
>>> to actually launch processes using spawn, but haven't seen examples for
>>> actually communicating once these processes are launched.  I am additionally
>>> interested in how processes created through multiple spawn calls can
>>> communicate.
>>>
>>> Does anyone know of resources that describe these topics?  My google-fu must
>>> not be up to par :)
>>>
>>> Thanks,
>>>   Brian
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OpenMPI how large its buffer size ?

2010-07-12 Thread jody

Hi
> mpi_irecv(workerNodeID, messageTag, bufferVector[row][column])
OpenMPI contains no function of this form.
There is MPI_Irecv, but it takes a different number of arguments.

Or is this a boost method?
If yes, i guess you have to make sure that the
bufferVector[row][column] is large enough...
Perhaps there is a boost forum you can check out if the problem persists

Jody


On Sun, Jul 11, 2010 at 10:13 AM, Jack Bryan  wrote:
> thanks for your reply.
> The message size is 72 bytes.
> The master sends out the message package to each 51 nodes.
> Then, after doing their local work, the worker node send back the same-size
> message to the master.
> Master use vector.push_back(new messageType) to receive each message from
> workers.
> Master use the
> mpi_irecv(workerNodeID, messageTag, bufferVector[row][column])
> to receive the worker message.
> the row is the rankID of each worker, the column is index for  message from
> worker.
> Each worker may send multiple messages to master.
> when the worker node size is large, i got MPI_ERR_TRUNCATE error.
> Any help is appreciated.
> JACK
> July 10  2010
>
> 
> Date: Sat, 10 Jul 2010 23:12:49 -0700
> From: eugene@oracle.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] OpenMPI how large its buffer size ?
>
> Jack Bryan wrote:
>
> The master node can receive message ( the same size)  from 50 worker nodes.
> But, it cannot receive message from 51 nodes. It caused "truncate error".
>
> How big was the buffer that the program specified in the receive call?  How
> big was the message that was sent?
>
> MPI_ERR_TRUNCATE means that you posted a receive with an application buffer
> that turned out to be too small to hold the message that was received.  It's
> a user application error that has nothing to do with MPI's internal
> buffers.  MPI's internal buffers don't need to be big enough to hold that
> message.  MPI could require the sender and receiver to coordinate so that
> only part of the message is moved at a time.
>
> I used the same buffer to get the message in 50 node case.
> About ""rendezvous" protocol", what is the meaning of "the sender sends a
> short portion "?
> What is the "short portion", is it a small mart of the message of the sender
> ?
>
> It's at least the message header (communicator, tag, etc.) so that the
> receiver can figure out if this is the expected message or not.  In
> practice, there is probably also some data in there as well.  The amount of
> that portion depends on the MPI implementation and, in practice, the
> interconnect the message traveled over, MPI-implementation-dependent
> environment variables set by the user, etc.  E.g., with OMPI over shared
> memory by default it's about 4Kbytes (if I remember correctly).
>
> This "rendezvous" protocol" can work automatically in background without
> programmer
> indicates in his program ?
>
> Right.  MPI actually allows you to force such synchronization with
> MPI_Ssend, but typically MPI implementations use it automatically for
> "plain" long sends as well even if the user didn't not use MPI_Ssend.
>
> The "acknowledgement " can be generated by the receiver only when the
> corresponding mpi_irecv is posted by the receiver ?
>
> Right.
>
> 
> The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with
> Hotmail. Get busy.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] hpw to log output of spawned processes

2010-07-12 Thread jody

yes, i'm using 1.4.2

Thanks
  Jody

On Mon, Jul 12, 2010 at 10:38 AM, Ralph Castain  wrote:
>
> On Jul 12, 2010, at 2:17 AM, jody wrote:
>
>> Hi
>>
>> I have a master process which spawns a number of workers of which i'd
>> like to  save the output in separate files.
>>
>> Usually i use the '-output-filename' option in such a situation.
>> However, if i do
>>  mpirun -np 1 -output-filename work_out master arg1 arg2
>> all the files work_out.1, work_out.2, ... are ok,
>> but work_out.0 contains both outputs of the master process(process 0
>> in COMM_WORLD) and
>> of the first worker (process 0 in the communicator of the spawned processes).
>
> Crud - that's a bug.
>
>>
>> I also tried the '-tag-output' option, but this involves several
>> additional steps,
>> because i have to separate the combined outputs
>>  mpirun -np 1 -tag-output  master arg1 arg2 > total.out
>>  grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out
>>  grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out
>>  grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out
>>  ...
>> Of course, this could be wrapped in a script,  but it is a bit cumbersome
>> (and i am not sure if the job-ids are always "1" and "2") ...
>>
>> Is there some simpler way to separate the output of the two streams?
>
> Not really.
>
>>
>> If not, would it be possible to extend the -output-filename option i
>> such a way that it
>> would also combine job-id and rank withe the output file:
>>  work_out.1.0
>> for the master's output, and
>>  work_out.2.0
>>  work_out.2.1
>>  work_out.2.2
>>  ...
>> for the worker's output?
>
> Yeah, I can do that - will put something together. Are you doing this in the 
> 1.4 series?
>
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] hpw to log output of spawned processes

2010-07-12 Thread jody

Hi

I have a master process which spawns a number of workers of which i'd
like to  save the output in separate files.

Usually i use the '-output-filename' option in such a situation.
However, if i do
  mpirun -np 1 -output-filename work_out master arg1 arg2
all the files work_out.1, work_out.2, ... are ok,
but work_out.0 contains both outputs of the master process(process 0
in COMM_WORLD) and
of the first worker (process 0 in the communicator of the spawned processes).

I also tried the '-tag-output' option, but this involves several
additional steps,
because i have to separate the combined outputs
  mpirun -np 1 -tag-output  master arg1 arg2 > total.out
  grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out
  grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out
  grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out
  ...
Of course, this could be wrapped in a script,  but it is a bit cumbersome
(and i am not sure if the job-ids are always "1" and "2") ...

Is there some simpler way to separate the output of the two streams?

If not, would it be possible to extend the -output-filename option i
such a way that it
would also combine job-id and rank withe the output file:
  work_out.1.0
for the master's output, and
  work_out.2.0
  work_out.2.1
  work_out.2.2
  ...
for the worker's output?

Thank You
  Jody

Re: [OMPI users] Dynamic process tutorials?

2010-07-11 Thread jody

Hi Brian
When you spawn processes with MPI_Comm_spawn(), one of the arguments
will be set to an intercommunicator of thes spawner and the spawnees.
You can use this intercommunicator as the communicator argument
in the MPI_functions.

Jody
On Fri, Jul 9, 2010 at 5:56 PM, Brian Budge  wrote:
> Hi all -
>
> I've been looking at the dynamic process features of mpi-2.  I have managed
> to actually launch processes using spawn, but haven't seen examples for
> actually communicating once these processes are launched.  I am additionally
> interested in how processes created through multiple spawn calls can
> communicate.
>
> Does anyone know of resources that describe these topics?  My google-fu must
> not be up to par :)
>
> Thanks,
>   Brian
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OpenMPI how large its buffer size ?

2010-07-10 Thread jody

Perhaps i misunderstand your question...
Generally, it is the user's job to provide the buffers both to send and receive.
If you call MPI_Recv, you must pass a buffer that is large enough to
hold the data sent by the
corresponding MPI_Send. I.e., if you know your sender will send
messages of 100kB,
then you must provide a buffer of size 100kB to the receiver.
If the message size is unknown at compile time, you may have to send
two messages:
first an integer which tells the receiver how large a buffer it has to
allocate, and then
the actual message (which then nicely fits into the freshly allocated buffer)

#include 
#include 

#include 


#include "mpi.h"

#define SENDER 1
#define RECEIVER   0
#define TAG_LEN   77
#define TAG_DATA  78
#define MAX_MESSAGE 16

int main(int argc, char *argv[]) {

int num_procs;
int rank;
int *send_buf;
int *recv_buf;
int send_message_size;
int recv_message_size;
MPI_Status st;
int i;

/* initialize random numbers */
srand(time(NULL));
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == RECEIVER) {
/* the receiver */
/* wait for message length */
MPI_Recv(&recv_message_size, 1, MPI_INT, SENDER, TAG_LEN,
MPI_COMM_WORLD, &st);
/* create a buffer of the required size */
recv_buf = (int*) malloc(recv_message_size*sizeof(int));
/* get data */
MPI_Recv(recv_buf, recv_message_size, MPI_INT, SENDER,
TAG_DATA, MPI_COMM_WORLD, &st);

printf("Receiver got %d integers:", recv_message_size);
for (i = 0; i < recv_message_size; i++) {
printf(" %d", recv_buf[i]);
}
printf("\n");

/* clean up */
free(recv_buf);

} else if (rank == SENDER) {
/* the sender */
/* random message size */
send_message_size = (int)((1.0*MAX_MESSAGE*rand())/(1.0*RAND_MAX));
/* create a buffer of the required size */
send_buf = (int*) malloc(send_message_size*sizeof(int));
/* create random message */
for (i = 0; i < send_message_size; i++) {
send_buf[i] = rand();
}

printf("Sender has %d integers:", send_message_size);
for (i = 0; i < send_message_size; i++) {
printf(" %d", send_buf[i]);
}
printf("\n");

/* send message size to receiver */
MPI_Send(&send_message_size,  1, MPI_INT, RECEIVER, TAG_LEN,
MPI_COMM_WORLD);
/* now send messagge */
MPI_Send(send_buf, send_message_size, MPI_INT, RECEIVER,
TAG_DATA, MPI_COMM_WORLD);

/* clean up */
free(send_buf);

}

MPI_Finalize();
}

I hope this helps
  Jody


On Sat, Jul 10, 2010 at 7:12 AM, Jack Bryan  wrote:
> Dear All:
> How to find the buffer size of OpenMPI ?
> I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4.
> Many nodes need to send data to the same node .
> Workers use mpi_isend, the receiver node use  mpi_irecv.
> because they are non-blocking, the messages are stored in buffers of
> senders.
> And then, the receiver collect messages from its buffer.
> If the receiver's buffer is too small, there will be truncate error.
> Any help is appreciated.
> Jack
> July 9  2010
>
> 
> Hotmail is redefining busy with tools for the New Busy. Get more from your
> inbox. See how.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated

2010-07-08 Thread jody

Hi Jack

100 kbytes are not really big messages sizes. My applications
routinely exchange larger amounts.

The MPI_ERR_TRUNCATE error means that a buffer you provided to
MPI_Recv is too small
to hold the data to be received. Check the size of the data you send
and compare it with the size
of the buffer you passed to MPI_Recv.

As Zhang suggested: try to reduce your code to isolate the offending codes.
Can you create a simple application with two processes exchanging data which has
the MPI_ERR_TRUNCATE problem?

Jody






On Thu, Jul 8, 2010 at 5:39 AM, Jack Bryan  wrote:
> thanks
> Wat if the master has to send and receive large data package ?
> It has to be splited into multiple parts ?
> This may increase communication overhead.
> I can use MPI_datatype to wrap it up as a specific datatype, which can carry
> the
> data.
> What if the data is very large? 1k bytes or 10 kbytes , 100 kbytes ?
> the master need to collect the same datatype from all workers.
> So, in this way, the master has to set up a data pool to get all data.
> The master's buffer provided by the MPI may not be large enough to do this.
> Are there some other ways to do it ?
> Any help is appreciated.
> thanks
> Jack
> july 7  2010
>
> 
> From: solarbik...@gmail.com
> Date: Wed, 7 Jul 2010 17:32:27 -0700
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated
>
> This error typically occurs when the received message is bigger than the
> specified buffer size.  You need to narrow your code down to offending
> receive command to see if this is indeed the case.
>
> On Wed, Jul 7, 2010 at 8:42 AM, Jack Bryan  wrote:
>
> Dear All:
> I need to transfer some messages from workers master node on MPI cluster
> with Open MPI.
> The number of messages is fixed.
> When I increase the number of worker nodes, i got error:
> --
> terminate called after throwing an instance of
> 'boost::exception_detail::clone_impl
>>'
>   what():  MPI_Unpack: MPI_ERR_TRUNCATE: message truncated
> [n231:45873] *** Process received signal ***
> [n231:45873] Signal: Aborted (6)
> [n231:45873] Signal code:  (-6)
> [n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0]
> [n231:45873] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215]
> [n231:45873] [ 2] /lib64/libc.so.6(abort+0x110) [0x3c50231cc0]
>
> --
> For 40 workers , it works well.
> But for 50 workers, it got this error.
> The largest message size is not more then 72 bytes.
> Any help is appreciated.
> thanks
> Jack
> July 7 2010
> 
> The New Busy is not the too busy. Combine all your e-mail accounts with
> Hotmail. Get busy.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> David Zhang
> University of California, San Diego
>
> 
> The New Busy is not the too busy. Combine all your e-mail accounts with
> Hotmail. Get busy.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Open MPI, cannot get the results from workers

2010-07-06 Thread jody

Hi
I solved this problem in such a way that my master listens for
messages from everybody (MPI_ANY_SOURCE) and reacts to all tags
(MPI_ANY_TAG).
By looking at the status variable set by MPI_Recv, the master can find
out who sent the message (status.MPI_SOURCE) and what tag it has
(status.MPI_TAG)
and react accordingly

Jody

On Tue, Jul 6, 2010 at 7:41 AM, David Zhang  wrote:
> if the master receives multiple results from the same worker, how does the
> master know which result (and the associated tag) arrive first? what MPI
> commands are you using exactly?
>
> On Mon, Jul 5, 2010 at 4:25 PM, Jack Bryan  wrote:
>>
>> When the master sends out the task, it assign a distinct task number ID
>> to
>> the task.
>> When the worker receive the task, it  still use the task's assigned ID as
>> task tag to send it to master.
>> Any help is appreciated.
>> July 5 2010
>>
>>
>>
>> 
>> From: solarbik...@gmail.com
>> Date: Mon, 5 Jul 2010 13:17:27 -0700
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] Open MPI, cannot get the results from workers
>>
>> how does the master receive results from the workers? if a worker is
>> sending multiple task results, how does the master knows what the message
>> tags are ahead of time?
>>
>> On Sun, Jul 4, 2010 at 10:26 AM, Jack Bryan 
>> wrote:
>>
>> Dear All :
>> I designed a master-worker framework, in which the master can schedule
>> multiple tasks (numTaskPerWorkerNode) to each worker and then collects
>> results from workers.
>> if the numTaskPerWorkerNode = 1, it works well.
>> But, if numTaskPerWorkerNode > 1, the master cannot get the results from
>> workers.
>> But, the workers can get the tasks from master.
>> why ?
>>
>> I have used different taskTag to distinguish the tasks, but still does not
>> work.
>> Any help is appreciated.
>> Thanks,
>> Jack
>> July 4  2010
>> 
>> The New Busy is not the too busy. Combine all your e-mail accounts with
>> Hotmail. Get busy.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> David Zhang
>> University of California, San Diego
>>
>> 
>> The New Busy is not the old busy. Search, chat and e-mail from your inbox.
>> Get started.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> David Zhang
> University of California, San Diego
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Open MPI task scheduler

2010-06-21 Thread jody

Hi

I think your problem can be solved easily on  the MPI level.
Just hav you manager execute a loop in which it waits for any message.
Define different message types by their MPI-tags. Once a message
has been received, decide what to do by looking at the tag.

Here i assume that a worker with no job sends a message with the tag
TAG_TASK_REQUEST and then waits to receive a message from the master
with either a new task or the command to exit.
Once a worker has finished a tsk it sends a message with the tag TAG_RESULT,
and then sends a message containing the result.
Here i assume that new tasks can be sent from a different node by using
the tag TAG_NEW_TASK.

The main loop in the Master would be:

while (more_tasks) {
 MPI_Recv(&a, MPI_INT, 1, MPI_ANY_SOURCE, MPI_ANY_TAG, &st);
 switch (st.MPI_TAG) {
   case TAG_TASK_REQUEST:
 sendNextTask(st.MPI_SOURCE);
 break;
  case TAG_RESULT:
 collectResult(st.MPI_SOURCE);
 break;
  case TAG_NEW_TASK:
 putNewTaskOnQueue(st.MPI_SOURCE);
 break;
   }
}


In a worker:

  while (go_on) {
 MPI_Send(a, MPI_INT, 1, idMaster, TAG_TASK_REQUEST);
 MPI_Recv(&TaskDef, TaskType, 1, idMaster, MPI_ANY_TAG, &st);
 if (st.MPI_TAG == TAG_STOP) {
   go_on=false;
 } else {
   result=workOnTask(TaskDef, TaskLen);
   MPI_Send(a, MPI_INT, 1, idMaster, TAG_RESULT);
   MPI_Send(result, resultType, 1, idMaster, TAG_RESULT_CONTENT);
  }
}

I hope this helps
  Jody

On Mon, Jun 21, 2010 at 12:17 AM, Jack Bryan  wrote:
> Hi,
> thank you very much for your help.
> What is the meaning of " must find a system so that every task can be
> serialized in the same form." What is the meaning of "serize " ?
> I have no experience of programming with python and XML.
> I have studied your blog.
> Where can I find a simple example to use the techniques you have said ?
> For exmple, I have 5 task (print "hello world !").
> I want to use 6 processors to do it in parallel.
> One processr is the manager node who distributes tasks and other 5
> processors
> do the printing jobs and when they are done, they tell this to the manager
> noitde.
>
> Boost.Asio is a cross-platform C++ library for network and low-level I/O
> programming. I have no experiences of using it. Will it take a long time to
> learn
> how to use it ?
> If the messages are transferred by SOAP+TCP, how the manager node calls it
> and push task into it ?
> Do I need to install SOAP+TCP on my cluster so that I can use it ?
>
> Any help is appreciated.
> Jack
> June 20  2010
>> Date: Sun, 20 Jun 2010 21:00:06 +0200
>> From: matthieu.bruc...@gmail.com
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] Open MPI task scheduler
>>
>> 2010/6/20 Jack Bryan :
>> > Hi, Matthieu:
>> > Thanks for your help.
>> > Most of your ideas show that what I want to do.
>> > My scheduler should be able to be called from any C++ program, which can
>> > put
>> > a list of tasks to the scheduler and then the scheduler distributes the
>> > tasks to other client nodes.
>> > It may work like in this way:
>> > while(still tasks available) {
>> > myScheduler.push(tasks);
>> > myScheduler.get(tasks results from client nodes);
>> > }
>>
>> Exactly. In your case, you want only one server, so you must find a
>> system so that every task can be serialized in the same form. The
>> easiest way to do so is to serialize your parameter set as an XML
>> fragment and add the type of task as another field.
>>
>> > My cluster has 400 nodes with Open MPI. The tasks should be transferred
>> > b y
>> > MPI protocol.
>>
>> No, they should not ;) MPI can be used, but it is not the easiest way
>> to do so. You still have to serialize your ticket, and you have to use
>> some functions that are from MPI2 (so perhaps not as portable as MPI1
>> functions). Besides, it cannot be used from programs that do not know
>> of using MPI protocols.
>>
>> > I am not familiar with  RPC Protocol.
>>
>> RPC is not a protocol per se. SOAP is. RPC stands for Remote Procedure
>> Call. It is basically your scheduler that has several functions
>> clients can call:
>> - add tickets
>> - retrieve ticket
>> - ticket is done
>>
>> > If I use Boost.ASIO and some Python/GCCXML script to generate the code,
>> > it
>> > can be
>> > called from C++ program on Open MPI cluster ?
>>
>> Yes, SOAP is just an XML way of representing the fact that you call a
>> function on the server. You can use it with C++, Java, ... I use it
>> with Python to monitor

Re: [OMPI users] Allgather in inter-communicator bug,

2010-05-20 Thread jody

Hi
I am really no python expert, but it looks to me as if you were
gathering arrays filled with zeroes:
  a = array('i', [0]) * n

Shouldn't this line be
  a = array('i', [r])*n
where r is the rank of the process?

Jody


On Thu, May 20, 2010 at 12:00 AM, Battalgazi YILDIRIM
 wrote:
> Hi,
>
>
> I am trying to use intercommunicator ::Allgather between two child process.
> I have fortran and Python code,
> I am using mpi4py for python. It seems that ::Allgather is not working
> properly in my desktop.
>
>  I have contacted first mpi4py developers (Lisandro Dalcin), he simplified
> my problem and provided two example files (python.py and fortran.f90,
> please see below).
>
> We tried with different MPI vendors, the following example worked correclty(
> it means the final print out should be array('i', [1, 2, 3, 4, 5, 6, 7, 8])
> )
>
> However, it is not giving correct answer in my two desktop (Redhat and
> ubuntu) both
> using OPENMPI
>
> Could yo look at this problem please?
>
> If you want to follow our discussion before you, you can go to following
> link:
> http://groups.google.com/group/mpi4py/browse_thread/thread/c17c660ae56ff97e
>
> yildirim@memosa:~/python_intercomm$ more python.py
> from mpi4py import MPI
> from array import array
> import os
>
> progr = os.path.abspath('a.out')
> child = MPI.COMM_WORLD.Spawn(progr,[], 8)
> n = child.remote_size
> a = array('i', [0]) * n
> child.Allgather([None,MPI.INT],[a,MPI.INT])
> child.Disconnect()
> print a
>
> yildirim@memosa:~/python_intercomm$ more fortran.f90
> program main
>  use mpi
>  implicit none
>  integer :: parent, rank, val, dummy, ierr
>  call MPI_Init(ierr)
>  call MPI_Comm_get_parent(parent, ierr)
>  call MPI_Comm_rank(parent, rank, ierr)
>  val = rank + 1
>  call MPI_Allgather(val,   1, MPI_INTEGER, &
>                     dummy, 0, MPI_INTEGER, &
>                     parent, ierr)
>  call MPI_Comm_disconnect(parent, ierr)
>  call MPI_Finalize(ierr)
> end program main
>
> yildirim@memosa:~/python_intercomm$ mpif90 fortran.f90
>
> yildirim@memosa:~/python_intercomm$ python python.py
> array('i', [0, 0, 0, 0, 0, 0, 0, 0])
>
>
> --
> B. Gazi YILDIRIM
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] How to show outputs from MPI program that runs on a cluster?

2010-05-20 Thread jody

Hi
mpirun has an option for this (check the mpirun man page):

   -tag-output, --tag-output
  Tag  each  line  of  output to stdout, stderr, and
stddiag with [jobid, rank] indicating the process jobid and
rank that generated the
  output, and the channel which generated it.

Using this you can filter the entire output by grepping for the required rank.

Another possibility is to use the option
   -xterm, --xterm 
  Display the specified ranks in separate xterm windows.
The ranks are specified as a comma-separated list of ranges, with a -1
indicating  all.
  A separate window will be created for each specified
rank.  Note: In some environments, xterm may require that the
executable be in the user’s
  path, or be specified in absolute or relative terms.
Thus, it may be necessary to specify a local executable as "./foo"
instead of just "foo".
  If xterm fails to find the executable, mpirun will hang,
but still respond correctly to a ctrl-c.  If this happens, please
check that the exe-
  cutable is being specified correctly and try again.

That way you can open a single terminal window for the process you are
interested in.


Jody


On Thu, May 20, 2010 at 1:28 AM, Sang Chul Choi  wrote:
> Hi,
>
> I am wondering if there is a way to run a particular process among multiple 
> processes on the console of a linux cluster.
>
> I want to see the screen output (standard output) of a particular process 
> (using a particular ID of a process) on the console screen while the MPI 
> program is running.  I think that if I run a MPI program on a linux cluster 
> using Sun Grid Engine, the particular process that prints out to standard 
> output could run on the console or computing node.   And, it would be hard to 
> see screen output of the particular process.  Is there a way to to set one 
> process aside and to run it on the console in Sun Grid Engine?
>
> When I run the MPI program on my desktop with quad cores, I can set aside one 
> process using an ID to print information that I need.  I do not know how I 
> could do that in much larger scale like using Sun Grid Engine.  I could let 
> one process print out in a file and then I could see it.  I do not know how I 
> could let one process to print out on the console screen by setting it to run 
> on the console using Sun Grid Engine or any other similar thing such as PBS.  
> I doubt that a cluster would allow jobs to run on the console because then 
> others users would have to be in trouble in submitting jobs.  If this is the 
> case, there seem no way to print out on the console.   Then, do I have to 
> have a separate (non-MPI) program that can communicate with MPI program using 
> TCP/IP by running the separate program on the master node of a cluster?  This 
> separate non-MPI program may then communicate sporadically with the MPI 
> program.  I do not know if this is a general approach or a peculiar way.
>
> I will appreciate any of input.
>
> Thank you,
>
> Sang Chul
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Dynamic libraries in OpenMPI

2010-05-12 Thread jody

Just to be sure:
Is there a copy of  the shared library on the other host (hpcnode1) ?

jody

On Mon, May 10, 2010 at 5:20 PM, Prentice Bisbal  wrote:
> Are you runing thee jobs through a queuing system like PBS, Torque, or SGE?
>
> Prentice
>
> Miguel Ángel Vázquez wrote:
>> Hello Prentice,
>>
>> Thank you for your advice but that doesn't solve the problem.
>>
>> The non-login bash updates properly the $LD_LIBRARY_PATH value.
>>
>> Any other idea?
>>
>> Thanks,
>>
>> Miguel
>>
>> 2010/5/7 Prentice Bisbal mailto:prent...@ias.edu>>
>>
>>
>>
>>     Miguel Ángel Vázquez wrote:
>>     > Dear all,
>>     >
>>     > I am trying to run a C++ program which uses dynamic libraries
>>     under mpi.
>>     >
>>     > The compilation command looks like:
>>     >
>>     >  mpiCC `pkg-config --cflags itpp`  -o montecarlo  montecarlo.cpp
>>     > `pkg-config --libs itpp`
>>     >
>>     > And it works if I executed it in one machine:
>>     >
>>     > mpirun -np 2 -H localhost montecarlo
>>     >
>>     > I tested this both in the "master node" and in the "compute nodes" and
>>     > it works. However, when I try to run it with two different machines:
>>     >
>>     > mpirun -np 2 -H localhost,hpcnode1 montecarlo
>>     >
>>     > The program claims that it can't find the shared libraries:
>>     >
>>     > montecarlo: error while loading shared libraries: libitpp.so.6: cannot
>>     > open shared object file: No such file or directory
>>     >
>>     > The LD_LIBRARY_PATH is set properly at every machine, any idea
>>     where the
>>     > problem is? I attached you the config.log and the result of the
>>     omp-info
>>     > --all
>>     >
>>     > Thank you in advance,
>>     >
>>     > Miguel
>>
>>     Miguel,
>>
>>     Shells behave differently depending on whether it is an interactive
>>     login shell or a non-interactive shell. For example, the bash shell uses
>>     .bash_profile in case, but .bashrc in the other. Check the documentation
>>     for your shell and see what files it uses in each case, and make sure
>>     the non-login config file has the necessary settings for your MPI jobs.
>>      It sounds like your login shell environment is okay, but your non-login
>>     environment isn't setup correctly. This is a common problem.
>>
>>     I use bash, and to keep it simple, my .bash_profile is just a symbolic
>>     link to .bashrc. That way, both shell types have the same environment.
>>     This isn't always a good idea, but in my case it's fine.
>>
>>     --
>>     Prentice
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS

2010-04-26 Thread jody

Hi Asad

I must admit i don't know how one can find out whether extended precision
is being used or not.
I think one has to read up on the CPU's information.
I only know that most Intel 32bit-Processors use the extended precision
  http://en.wikipedia.org/wiki/X86
as does AMD Athlon

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/fpu_wp.pdf
but i think AMD Opteron does not.
But i am no expert in this area - i only found out about this when i
mentioned to someone
the differences in the results obtained from a 32Bit platform and a
64bit platform. Sorry.

Jody

On Mon, Apr 26, 2010 at 4:33 AM, Asad Ali  wrote:
> Hi Jodi,
>
>> I once got different results when running on a 64-Bit platform instead of
>> a 32 bit platform - if i remember correctly, the reason was that on the
>> 32-bit platform 80bit extended precision floats were used but on the 64bit
>> platform only 64bit floats.
>
> Could you please give me an idea as how to check this extended precision.
> Also I don't use float rather I use only double or long double.
>
> Cheers,
>
> Asad
> --
> "Statistical thinking will one day be as necessary for efficient citizenship
> as the ability to read and write." - H.G. Wells
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS

2010-04-25 Thread jody

I once got different results when running on a 64-Bit platform instead of
a 32 bit platform - if i remember correctly, the reason was that on the
32-bit platform 80bit extended precision floats were used but on the 64bit
platform only 64bit floats.



On Sun, Apr 25, 2010 at 3:39 AM, Fabian Hänsel  wrote:
> Hi Asad,
>
>> I found that running the same source code on these OS, with the same
>> versions of of gcc and open-mpi installed on them, gives different results
>>  than Fedora and Ubuntu after a few hundred iterations. The first
>> few hundered iterations are exactly similar to that of  Fedora and Ubuntu
>> but then it starts giving different results.
>
> Are you also using the same hardware? Different hardware platforms may
> exhibit different rounding behaviour. After some dependent interations such
> effects might indeed sum up and yield different results. The issue is called
> numeric (in)stability and is not specifically related to openmpi.
>
> Best regards
>  Fabian
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Error on sending argv

2010-04-20 Thread jody

Hi
You should remove the "&" for the first parameters of your MPI_Send
and MPI_Recv:

MPI_Send(text, strlen(text) + 1, MPI_CHAR, 1, 0, MPI_COMM_WORLD);

MPI_Recv(buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);

In C/C++ the name of an array is a pointer to the start of the array
(however, i can't exactly explain
why it worked with the hard-coded string))

Jody


On Mon, Apr 19, 2010 at 6:31 PM, Andrew Wiles  wrote:
> Hi all Open MPI users,
>
> I write a simple MPI program to send a text message to another process. The
> code is below.
>
> (test.c)
>
> #include "mpi.h"
>
> #include 
>
> #include 
>
> #include 
>
>
>
> int main(int argc, char* argv[]) {
>
>     int dest, noProcesses, processId;
>
>     MPI_Status status;
>
>
>
>     char* buffer;
>
>
>
>     char* text = "ABCDEF";
>
>
>
>     MPI_Init(&argc, &argv);
>
>     MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
>
>     MPI_Comm_rank(MPI_COMM_WORLD, &processId);
>
>
>
>     buffer = (char*) malloc(256 * sizeof(char));
>
>
>
>     if (processId == 0) {
>
>       fprintf(stdout, "Master: sending %s to %d\n", text, 1);
>
>       MPI_Send((void *)&text, strlen(text) + 1, MPI_CHAR, 1, 0,
> MPI_COMM_WORLD);
>
>     } else {
>
>       MPI_Recv(&buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
> MPI_COMM_WORLD, &status);
>
>       fprintf(stdout, "Slave: received %s from %d\n", buffer,
> status.MPI_SOURCE);
>
>     }
>
>     MPI_Finalize();
>
>     return 0;
>
> }
>
> After compiling and executing it I get the following output:
>
> [root@cluster Desktop]# mpicc -o test test.c
>
> [root@cluster Desktop]# mpirun -np 2 test
>
> Master: sending ABCDEF to 1
>
> Slave: received ABCDEF from 0
>
>
>
> In the source code above, I replace
>
> char* text = "ABCDEF";
>
> by
>
> char* text = argv[1];
>
> then compile and execute it again with the following commands:
>
> [root@cluster Desktop]# mpicc -o test test.c
>
> [root@cluster Desktop]# mpirun -np 2 test ABCDEF
>
> Then I get the following output:
>
> Master: sending ABCDEF to 1
>
> [cluster:03917] *** Process received signal ***
>
> [cluster:03917] Signal: Segmentation fault (11)
>
> [cluster:03917] Signal code: Address not mapped (1)
>
> [cluster:03917] Failing at address: 0xbfa445a2
>
> [cluster:03917] [ 0] [0x959440]
>
> [cluster:03917] [ 1] /lib/libc.so.6(_IO_fprintf+0x22) [0x76be02]
>
> [cluster:03917] [ 2] test(main+0x143) [0x80488b7]
>
> [cluster:03917] [ 3] /lib/libc.so.6(__libc_start_main+0xdc) [0x73be8c]
>
> [cluster:03917] [ 4] test [0x80486c1]
>
> [cluster:03917] *** End of error message ***
>
> --
>
> mpirun noticed that process rank 1 with PID 3917 on node cluster.hpc.org
> exited on signal 11 (Segmentation fault).
>
> --
>
> I’m very confused because the only difference between the two source codes
> is the difference between
>
> char* text = "ABCDEF";
>
> and
>
> char* text = argv[1];
>
> Can any one help me why the results are so different? How can I send argv[i]
> to another process?
>
> Thank you very much!
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Help om Openmpi

2010-04-06 Thread jody

@Trent
> the 1024 RSA has already been cracked.
Yeah but unless you've got 3 guys spending 100 hours varying the
voltage of your processors
it is still safe... :)


On Tue, Apr 6, 2010 at 11:35 AM, Reuti  wrote:
> Hi,
>
> Am 06.04.2010 um 09:48 schrieb Terry Frankcombe:
>
>>>   1. Run the following command on the client
>>>          * -> ssh-keygen -t dsa
>>>   2. File id_dsa and id_dsa.pub will be created inside $HOME/.ssh
>>>   3. Copy id_dsa.pub to the server's .ssh directory
>>>          * -> scp $HOME/.ssh/id_dsa.pub user@server:/home/user/.ssh
>>>   4. Change to /root/.ssh and create file authorized_keys containing
>>> id_dsa content
>>>          * -> cd /home/user/.ssh
>>>          * -> cat id_dsa >> authorized_keys
>>>   5. You can try ssh to the server from the client and no password
>>> will be needed
>>>          * -> ssh user@server
>>
>> That prescription is a little messed up.  You need to create id_dsa and
>> id_dsa.pub on the client, as above.
>>
>> But it is the client's id_dsa.pub that needs to go
>> into /home/user/.ssh/authorized_keys on the server, which seems to be
>> not what the above recipe does.
>>
>> If that doesn't help, try adding -v or even -v -v to the ssh command to
>> see what the connection is trying to do w.r.t. your keys.
>
> inside a cluster I suggest hostbased authentication. No keys for the user, a 
> common used ssh_known_hosts file and a central place to look for errors.
>
> Passphraseless ssh-keys I just dislike as they tempt the user to copy them to 
> all remote location (especially the private part) to get more comfort while 
> using ssh between two remote clusters, but using an ssh-agent would in this 
> case be a more secure option.
>
> -- Reuti
>
>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] openMPI on Xgrid

2010-03-30 Thread Jody Klymak



On Mar 30, 2010, at  11:12 AM, Cristobal Navarro wrote:


i just have some questions,
Torque requires moab, but from what i've read on the site you have  
to buy moab right?


I am pretty sure you can download torque w/o moab.  I do not use moab,  
which I think is a higher-level scheduling layer on top of pbs.   
However, there are folks here who would know far more than I do about  
these sorts of things.


Cheers,  Jody

--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Klymak Jody



I have an environment a few trusted users could use to test.  However,  
I have neither the expertise or time to do the debugging myself.


Cheers,  Jody

On 2010-03-29, at 1:27 PM, Jeff Squyres wrote:


On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote:


i realized that xcode dev tools include openMPI 1.2.x
should i keep trying??
or do you recommend to completly abandon xgrid and go for another  
tool like Torque with openMPI?


FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes  
a few years worth of improvements and bug fixes since the 1.2 series.


It would be great (hint hint) if someone could fix the xgrid support  
for us...  We simply no longer have anyone in the active development  
group who has the expertise or test environment to make our xgrid  
work.  :-(


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] openMPI on Xgrid

2010-03-29 Thread Jody Klymak



On Mar 29, 2010, at  12:39 PM, Ralph Castain wrote:



On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote:


thanks for the information,

but is it possible to make it work with xgrid or the 1.4.1 version  
just dont support it?




FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5  
Server.


http://www.clusterresources.com/products/torque-resource-manager.php

It doesn't have a nice dashboard, but the queue tools are more than  
adequate for my needs.


Open MPI had a funny port issue on my setup that folks helped with

From my notes:

Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure
that the right ports are used:


# set ports so that they are more valid than the default ones (see  
email from Ralph Castain)

btl_tcp_port_min_v4 = 36900
btl_tcp_port_range  = 32


Cheers,  Jody


--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] Segmentation fault (11)

2010-03-27 Thread jody

I'm not sure if this is the cause of your problems:
You define the constant BUFFER_SIZE, but in the code you use a constant
called BUFSIZ...
Jody


On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam wrote:

> Dear All,
>   I am having a problem with openmpi . I have installed openmpi
> 1.4 and blcr 0.8.1
>
> I have written a small mpi application as follows below:
>
> ###
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include
> #include 
> #include 
>
> #define BUFFER_SIZE PIPE_BUF
>
> char * getprocessid()
> {
> FILE * read_fp;
> char buffer[BUFSIZ + 1];
> int chars_read;
> char * buffer_data="12345";
> memset(buffer, '\0', sizeof(buffer));
>   read_fp = popen("uname -a", "r");
>  /*
>   ...
>  */
>  return buffer_data;
> }
>
> int main(int argc, char ** argv)
> {
>   MPI_Status status;
>  int rank;
>int size;
> char * thedata;
> MPI_Init(&argc, &argv);
> MPI_Comm_size(MPI_COMM_WORLD,&size);
> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>  thedata=getprocessid();
>  printf(" the data is %s", thedata);
> MPI_Finalize();
> }
> 
>
> I get the following result:
>
> ###
> jean@sunn32:~$  mpicc pipetest2.c -o pipetest2
> jean@sunn32:~$  mpirun -np 1 -am ft-enable-cr -mca btl
> ^openib  pipetest2
> [sun32:19211] *** Process received signal ***
> [sun32:19211] Signal: Segmentation fault (11)
> [sun32:19211] Signal code: Address not mapped (1)
> [sun32:19211] Failing at address: 0x4
> [sun32:19211] [ 0] [0xb7f3c40c]
> [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
> [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a)
> [0xb7a5925a]
> [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72]
> [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266]
> [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e]
> [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc]
> [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836]
> [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897]
> [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455]
> [sun32:19211] [10] pipetest2 [0x8048761]
> [sun32:19211] *** End of error message ***
> #
>
>
> However, If I compile the application using gcc, it works fine. The problem
> arises with:
>   read_fp = popen("uname -a", "r");
>
> Does anyone has an idea how to resolve this problem?
>
> Many thanks
>
> Jean
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] problems on parallel writing

2010-02-25 Thread jody

Hi
Just wanted to let you know:

I translated your program to C ran it, and it crashed at MPI_FILE_SET_VIEW
in a similar way than yours did.
then i added an if-clause to prevent the call of MPI_FILE_WRITE with
the undefined value.
if (myid == 0) {
MPI_File_write(fh, temp, count, MPI_DOUBLE, &status);
}
After this it ran without crash.
However, the output is not what you expected:
The number 2122010.0 was not there - probably overwritten by the
MPI_FILE_WRITE_ALL.
But this was fixed by replacing the line
  disp=0
by
  disp=8
and removing the
  if (single_no .gt. 0) map = map + 1
statement.

So here's what all looks like:
===
program test_MPI_write_adv2

  !-- Template for any mpi program

  implicit none

  !--Include the mpi header file
  include 'mpif.h'  ! --> Required statement

  !--Declare all variables and arrays.
  integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info
  integer :: status(MPI_STATUS_SIZE)
  integer :: irc, ip
  integer(kind=mpi_offset_kind) :: offset, disp
  integer :: i, j, k

  integer :: num

  character(len=64) :: filename

  real(8), pointer :: q(:), temp(:)
  integer, pointer :: map(:)
  integer :: single_no, count

  !--Initialize MPI
  call MPI_INIT( ierr ) ! --> Required statement

  !--Who am I? --- get my rank=myid
  call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

  !--How many processes in the global group?
  call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

  if ( myid == 0 ) then
 single_no = 4
  elseif ( myid == 1 ) then
 single_no = 2
  elseif ( myid == 2 ) then
 single_no = 2
  elseif ( myid == 3 ) then
 single_no = 3
  else
 single_no = 0
  end if

  if (single_no .gt. 0) allocate(map(single_no))

  if ( myid == 0 ) then
 map = (/ 0, 2, 5, 6 /)
  elseif ( myid == 1 ) then
 map = (/ 1, 4 /)
  elseif ( myid == 2 ) then
 map = (/ 3, 9 /)
  elseif ( myid == 3 ) then
 map = (/ 7, 8, 10 /)
  end if

  if (single_no .gt. 0) allocate(q(single_no))

  if (single_no .gt. 0) then
 do i = 1,single_no
q(i) = dble(myid+1)*100.0d0 + dble(map(i)+1)
 end do
  end if

  if ( myid == 0 ) then
 count = 1
  else
 count = 0
  end if

  if (count .gt. 0) then
 allocate(temp(count))
 temp(1) = 2122010.0d0
  end if

  write(filename,'(a)') 'test_write.bin'

  call MPI_FILE_OPEN(MPI_COMM_WORLD, filename,
MPI_MODE_RDWR+MPI_MODE_CREATE, MPI_INFO_NULL, fh, ierr)

  if (my_id == 0) then
call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR)
  endif

  call MPI_TYPE_CREATE_INDEXED_BLOCK(single_no, 1, map,
MPI_DOUBLE_PRECISION, filetype, ierr)
  call MPI_TYPE_COMMIT(filetype, ierr)
  disp = 8  ! ---> size of MPI_REAL8 (number written when my_id = 0)
  call MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION, filetype,
'native', MPI_INFO_NULL, ierr)
  call MPI_FILE_WRITE_ALL(fh, q, single_no, MPI_DOUBLE_PRECISION, status, ierr)
  call MPI_FILE_CLOSE(fh, ierr)

  if (single_no .gt. 0) deallocate(map)

  if (single_no .gt. 0) deallocate(q)

  if (count .gt. 0) deallocate(temp)

  !--Finilize MPI
  call MPI_FINALIZE(irc)! ---> Required statement

  stop

end program test_MPI_write_adv2
===

Regards
  jody

On Thu, Feb 25, 2010 at 2:47 AM, Terry Frankcombe  wrote:
> On Wed, 2010-02-24 at 13:40 -0500, w k wrote:
>> Hi Jordy,
>>
>> I don't think this part caused the problem. For fortran, it doesn't
>> matter if the pointer is NULL as long as the count requested from the
>> processor is 0. Actually I tested the code and it passed this part
>> without problem. I believe it aborted at MPI_FILE_SET_VIEW part.
>>
>
> For the record:  A pointer is not NULL unless you've nullified it.
> IIRC, the Fortran standard says that any non-assigning reference to an
> unassigned, unnullified pointer is undefined (or maybe illegal... check
> the standard).
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] problems on parallel writing

2010-02-24 Thread jody

Hi
I can't answer your question about the array q offhand,
but i will try to translate your program to C and see if
it fails the same way.

Jody


On Wed, Feb 24, 2010 at 7:40 PM, w k  wrote:
> Hi Jordy,
>
> I don't think this part caused the problem. For fortran, it doesn't matter
> if the pointer is NULL as long as the count requested from the processor is
> 0. Actually I tested the code and it passed this part without problem. I
> believe it aborted at MPI_FILE_SET_VIEW part.
>
> Just curious, how does C handle the case that we need to collect data in
> array q but only part of the processors has q with a length greater than 0?
>
> Thanks for your reply,
> Kan
>
>
>
>
> On Wed, Feb 24, 2010 at 2:29 AM, jody  wrote:
>>
>> Hi
>> I know nearly nothing about fortran
>> but it looks to me as  the pointer 'temp' in
>>
>> > call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR)
>>
>> is not defined (or perhaps NULL?) for all processors except processor 0 :
>>
>> > if ( myid == 0 ) then
>> >     count = 1
>> >  else
>> >     count = 0
>> >  end if
>> >
>> > if (count .gt. 0) then
>> >     allocate(temp(count))
>> >     temp(1) = 2122010.0d0
>> >  end if
>>
>> In C/C++ something like this would almost certainly lead to a crash,
>> but i don't know if this would be the case in Fortran...
>> jody
>>
>>
>> On Wed, Feb 24, 2010 at 4:38 AM, w k  wrote:
>> > Hello everyone,
>> >
>> >
>> > I'm trying to implement some functions in my code using parallel
>> > writing.
>> > Each processor has an array, say q, whose length is single_no(could be
>> > zero
>> > on some processors). I want to write q down to a common file, but the
>> > elements of q would be scattered to their locations in this file. The
>> > locations of the elements are described by a map. I wrote my testing
>> > code
>> > according to an example in a MPI-2 tutorial which can be found here:
>> > www.npaci.edu/ahm2002/ahm_ppt/Parallel_IO_MPI_2.ppt. This way of writing
>> > is
>> > called "Accessing Irregularly Distributed Arrays" in this tutorial and
>> > the
>> > example is given in page 42.
>> >
>> > I tested my code with mvapich and got the result as expected. But when I
>> > tested it with openmpi, it didn't work. I tried the version 1.2.8 and
>> > 1.4
>> > and both didn't work. I tried two clusters. Both of them are intel chips
>> > (woodcrest and nehalem), DDR infiniband with Linux system. I got some
>> > error
>> > message like
>> >
>> > +++
>> > [n0883:08251] *** Process received signal ***
>> > [n0883:08249] *** Process received signal ***
>> > [n0883:08249] Signal: Segmentation fault (11)
>> > [n0883:08249] Signal code: Address not mapped (1)
>> > [n0883:08249] Failing at address: (nil)
>> > [n0883:08251] Signal: Segmentation fault (11)
>> > [n0883:08251] Signal code: Address not mapped (1)
>> > [n0883:08251] Failing at address: (nil)
>> > [n0883:08248] *** Process received signal ***
>> > [n0883:08250] *** Process received signal ***
>> > [n0883:08248] Signal: Segmentation fault (11)
>> > [n0883:08248] Signal code: Address not mapped (1)
>> > [n0883:08248] Failing at address: (nil)
>> > [n0883:08250] Signal: Segmentation fault (11)
>> > [n0883:08250] Signal code: Address not mapped (1)
>> > [n0883:08250] Failing at address: (nil)
>> > [n0883:08251] [ 0] /lib64/libpthread.so.0 [0x2b4f0a2f0d60]
>> > +++
>> >
>> >
>> >
>> > My testing code is here:
>> >
>> >
>> > ===
>> > program test_MPI_write_adv2
>> >
>> >
>> >   !-- Template for any mpi program
>> >
>> >   implicit none
>> >
>> >   !--Include the mpi header file
>> >   include 'mpif.h'  ! --> Required statement
>> >
>> >   !--Declare all variables and arrays.
>> >   integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info
>> >   integer :: status(MPI_STATUS_SIZE)
>> >   integer :: irc, ip
>> >   integer(kind=mpi_offset_kind) :: offset, disp
>> >   integer :: i, j, k
>

Re: [OMPI users] MPi Abort verbosity

2010-02-24 Thread jody

Hi Gabriele
you could always  pipe your output through grep

my_app | grep "MPI_ABORT was invoked"

jody

On Wed, Feb 24, 2010 at 11:28 AM, Gabriele Fatigati
 wrote:
> Hi Nadia,
>
> thanks for quick reply.
>
> But i suppose that parameter is 0 by default. Suppose i have the follw
> output:
>
> - --
> 
> - --> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode 4. <--
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> - --
> Inside my_mpi_err_handler
> Inside my_mpi_err_handler
> I am 0 and we are in 2
> I am 1 and we are in 2
> - --
> mpirun has exited due to process rank 0 with PID 3773 on
> node nb-user exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> - --
> - --
>
> I would like to see only this:
>
> - --> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
> with errorcode 4. <--
>
> And nothing else. Is it possible?
>
> I can upgrade my OpenMPI if necessary.
>
> Thanks.
>
>
> 2010/2/24 Nadia Derbey 
>>
>> On Wed, 2010-02-24 at 09:55 +0100, Gabriele Fatigati wrote:
>> >
>> > Dear Openmpi users and developers,
>> >
>> > i have a question about MPI_Abort error message. I have a program
>> > written in C++. Is there a way to decrease a verbosity of this error?
>> > When this function is called, openmpi prints many information like
>> > stack trace, rank of processor who called MPI_Abort ecc.. But i'm
>> > interesting just called rank. Is it possible?
>>
>> Hi,
>>
>> Setting the mca parameter "mpi_abort_print_stack" to 0 makes the stack
>> not printed out.
>> >
>> > Thanks in advance.
>> >
>> > I'm using openmpi 1.2.2
>>
>> ... well, don't know if it's available in that release...
>>
>>
>> Regards,
>> Nadia
>> > --
>> > Ing. Gabriele Fatigati
>> >
>> > Parallel programmer
>> >
>> > CINECA Systems & Tecnologies Department
>> >
>> > Supercomputing Group
>> >
>> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> >
>> > www.cineca.it                    Tel:   +39 051 6171722
>> >
>> > g.fatigati [AT] cineca.it
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> --
>> Nadia Derbey 
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.it                    Tel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] problems on parallel writing

2010-02-24 Thread jody

Hi
I know nearly nothing about fortran
but it looks to me as  the pointer 'temp' in

> call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR)

is not defined (or perhaps NULL?) for all processors except processor 0 :

> if ( myid == 0 ) then
> count = 1
>  else
> count = 0
>  end if
>
> if (count .gt. 0) then
> allocate(temp(count))
> temp(1) = 2122010.0d0
>  end if

In C/C++ something like this would almost certainly lead to a crash,
but i don't know if this would be the case in Fortran...
jody


On Wed, Feb 24, 2010 at 4:38 AM, w k  wrote:
> Hello everyone,
>
>
> I'm trying to implement some functions in my code using parallel writing.
> Each processor has an array, say q, whose length is single_no(could be zero
> on some processors). I want to write q down to a common file, but the
> elements of q would be scattered to their locations in this file. The
> locations of the elements are described by a map. I wrote my testing code
> according to an example in a MPI-2 tutorial which can be found here:
> www.npaci.edu/ahm2002/ahm_ppt/Parallel_IO_MPI_2.ppt. This way of writing is
> called "Accessing Irregularly Distributed Arrays" in this tutorial and the
> example is given in page 42.
>
> I tested my code with mvapich and got the result as expected. But when I
> tested it with openmpi, it didn't work. I tried the version 1.2.8 and 1.4
> and both didn't work. I tried two clusters. Both of them are intel chips
> (woodcrest and nehalem), DDR infiniband with Linux system. I got some error
> message like
>
> +++
> [n0883:08251] *** Process received signal ***
> [n0883:08249] *** Process received signal ***
> [n0883:08249] Signal: Segmentation fault (11)
> [n0883:08249] Signal code: Address not mapped (1)
> [n0883:08249] Failing at address: (nil)
> [n0883:08251] Signal: Segmentation fault (11)
> [n0883:08251] Signal code: Address not mapped (1)
> [n0883:08251] Failing at address: (nil)
> [n0883:08248] *** Process received signal ***
> [n0883:08250] *** Process received signal ***
> [n0883:08248] Signal: Segmentation fault (11)
> [n0883:08248] Signal code: Address not mapped (1)
> [n0883:08248] Failing at address: (nil)
> [n0883:08250] Signal: Segmentation fault (11)
> [n0883:08250] Signal code: Address not mapped (1)
> [n0883:08250] Failing at address: (nil)
> [n0883:08251] [ 0] /lib64/libpthread.so.0 [0x2b4f0a2f0d60]
> +++
>
>
>
> My testing code is here:
>
> ===
> program test_MPI_write_adv2
>
>
>   !-- Template for any mpi program
>
>   implicit none
>
>   !--Include the mpi header file
>   include 'mpif.h'  ! --> Required statement
>
>   !--Declare all variables and arrays.
>   integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info
>   integer :: status(MPI_STATUS_SIZE)
>   integer :: irc, ip
>   integer(kind=mpi_offset_kind) :: offset, disp
>   integer :: i, j, k
>
>   integer :: num
>
>   character(len=64) :: filename
>
>   real(8), pointer :: q(:), temp(:)
>   integer, pointer :: map(:)
>   integer :: single_no, count
>
>
>   !--Initialize MPI
>   call MPI_INIT( ierr ) ! --> Required statement
>
>   !--Who am I? --- get my rank=myid
>   call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
>
>   !--How many processes in the global group?
>   call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
>
>   if ( myid == 0 ) then
>  single_no = 4
>   elseif ( myid == 1 ) then
>  single_no = 2
>   elseif ( myid == 2 ) then
>  single_no = 2
>   elseif ( myid == 3 ) then
>  single_no = 3
>   else
>  single_no = 0
>   end if
>
>   if (single_no .gt. 0) allocate(map(single_no))
>
>   if ( myid == 0 ) then
>  map = (/ 0, 2, 5, 6 /)
>   elseif ( myid == 1 ) then
>  map = (/ 1, 4 /)
>   elseif ( myid == 2 ) then
>  map = (/ 3, 9 /)
>   elseif ( myid == 3 ) then
>  map = (/ 7, 8, 10 /)
>   end if
>
>   if (single_no .gt. 0) allocate(q(single_no))
>
>   if (single_no .gt. 0) then
>  do i = 1,single_no
>     q(i) = dble(myid+1)*100.0d0 + dble(map(i)+1)
>  end do
>   end if
>
>   if (single_no .gt. 0) map = map + 1
>
>   if ( myid == 0 ) then
>  count = 1
>   else
>  count = 0
>   end if
>
>   if (count .gt. 0) then
>  allocate(temp(count))
>  temp(1) = 2122010.0d0
>   end if
>
>   write(filename,'(a)') 'test_write.bin'
>
>   call MPI_FI

Re: [OMPI users] Non-homogeneous Cluster Implementation

2010-01-28 Thread jody

Hi
I'm not sure i completely understood.
Is it the case that an application compiled on the dell will not work
on the PS3 and vice versa?

If this is the case, you could try this:
  shell$ mpirun -np 1 --host a app_ps3 : -np 1 --host b app_dell
where app_ps3 is your application compiled on the PS3 and a is your PS3 host,
and app_dell is your application compiled on the dell, and b is your dell host.

Check the MPI FAQs
  http://www.open-mpi.org/faq/?category=running#mpmd-run
  http://www.open-mpi.org/faq/?category=running#mpirun-host

Hope this helps
  Jody

On Thu, Jan 28, 2010 at 3:08 AM, Lee Manko  wrote:
> OK, so please stop me if you have heard this before, but I couldn’t find
> anything in the archives that addressed my situation.
>
>
>
> I have a Beowulf cluster where ALL the node are PS3s running Yellow Dog
> Linux 6.2 and a host (server) that is a Dell i686 Quad-core running Fedora
> Core 12.  After a failed attempt at letting yum install openmpi, I
> downloaded v1.4.1, compiled and installed on all machines (PS3s and
> Dell).  I have an NSF shared directory on the host where the application
> resides after building.  All nodes have access to the shared volume and they
> can see any files in the shared volume.
>
>
>
> I wrote a very simple master/slave application where the slave does a simple
> computation and gets the processor name.  The slave returns both pieces of
> information to the master who then simply displays it in the terminal
> window.  After the slaves work on 1024 such tasks, the master exists.
>
>
>
> When I run on the host, without distributing to the nodes, I use the
> command:
>
>
>
> “mpirun –np 4 ./MPI_Example”
>
>
>
> Compiling and running the application on the native hardware works perfectly
> (ie: compiled and run on the PS3 or compiled and run on the Dell).
>
>
>
> However, when I went to scatter the tasks to the nodes, using the following
> command,
>
>
>
> “mpirun –np 4 –hostfile mpi-hostfile ./MPI_Example”
>
>
>
> the application fails.  I’m surmising that the issue is with running code
> that was compiled for the Dell on the PS3 since the MPI_Init will launch the
> application from the shared volume.
>
>
>
> So, I took the source code and compiled it on both the Dell and the PS3 and
> placed the executables in /shared_volume/Dell and /shared_volume/PS3 and
> added the paths to the environment variable PATH.  I tried to run the
> application from the host again using the following command,
>
>
>
> “mpirun –np 4 –hostfile mpi-hostfile –wdir
> /shared_volume/PS3 ./MPI_Example”
>
>
>
> Hoping that the wdir would set the working directory at the time of the call
> to MPI_Init() so that MPI_Init will launch the PS3 version of the
> executable.
>
>
>
> I get the error:
>
> Could not execute the executable “./MPI_Example” : Exec format error
>
> This could mean that your PATH or executable name is wrong, or that you do
> not
>
> have the necessary permissions.  Please ensure that the executable is able
> to be
>
> found and executed.
>
>
>
> Now, I know I’m gonna get some heat for this, but all of these machine use
> only the root account with full root privileges, so it’s not a permission
> issue.
>
>
>
>
>
> I am sure there is simple solution to my problem.  Replacing the host with a
> PS3 is not an option. Does anyone have any suggestions?
>
>
>
> Thanks.
>
>
>
> PS: When I get to programming the Cell BE, then I’ll use the IBM Cell SDK
> with its cross-compiler toolchain.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] man-files not installed

2009-12-21 Thread jody

Thanks, that did it!
BTW, in the man page for mpirun you should perhaps mention the "!"
option in xterm - the one that keeps the xterms open after the
application exits.

Thanks
  Jody


On Mon, Dec 21, 2009 at 3:25 PM, Ralph Castain  wrote:
> Is your MANPATH set to point to /opt/openmpi/man? Check the order as well to 
> make sure that is first - could be an older install (like the system default) 
> is before it.
>
> On Dec 21, 2009, at 5:46 AM, jody wrote:
>
>> Hi
>> I just installed open-mpi version 1.4,
>> and now i noticed that the man-files are not properly installed.
>>
>> When i do
>>  man mpirun
>> i get a different output than what is in
>>  openmpi/share/man/man1/mpirun.1
>>
>> I installed with this configuration:
>>  ./configure --prefix=/opt/openmpi-1.4 --disable-mpi-f77
>> --disable-mpi-f90 --with-threads
>> and afterwards made a soft link
>>
>> ln -s /opt/openmpi-1.4 /opt/openmpi
>>
>> This is on fedora fc8, but i have the same problem on my gentoo
>> machines (2.6.29-gentoo-r5)
>> Does anybody know how to get replace the old man files with the new ones?
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] man-files not installed

2009-12-21 Thread jody

Hi
I just installed open-mpi version 1.4,
and now i noticed that the man-files are not properly installed.

When i do
  man mpirun
i get a different output than what is in
  openmpi/share/man/man1/mpirun.1

I installed with this configuration:
  ./configure --prefix=/opt/openmpi-1.4 --disable-mpi-f77
--disable-mpi-f90 --with-threads
and afterwards made a soft link

ln -s /opt/openmpi-1.4 /opt/openmpi

This is on fedora fc8, but i have the same problem on my gentoo
machines (2.6.29-gentoo-r5)
Does anybody know how to get replace the old man files with the new ones?

Thank You
  Jody

Re: [OMPI users] Debugging spawned processes

2009-12-21 Thread jody

Hi Ralph

I finally got around to install version 1.4.
The xterm works fine.

And in order to get gdb going on the spawned processes, i need to add
an argument "--args"
in the argument list of the spawner so that the parameters of the
spawned processes are getting through gdb.

Thanks again
  Jody


On Fri, Dec 18, 2009 at 10:46 PM, Ashley Pittman  wrote:
> On Wed, 2009-12-16 at 12:06 +0100, jody wrote:
>
>> Has anybody got some hints on how to debug spawned processes?
>
> If you can live with the processes starting normally and attaching gdb
> to them after they have started then you could use padb.
>
> Assuming you only have one job active (replace -a with the job-id if you
> don't) and watch to target the first spawned job then the following
> command will launch an xterm for each rank in the job and automatically
> attach to the process for you.
>
> padb -Oorte-job-step=2 --command -Ocommand="xterm -T %r -e 'gdb -p %p'"
> -a
>
> You'll need to use the SVN version of padb for this, the "orte-job-step"
> option tells it to attach to the first spawned job, use orte-ps to see
> the list of job steps.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Debugging spawned processes

2009-12-17 Thread jody

yeah, know that you mention it, i remember (old brain here, as well)
But IIRC you created a OMPI version which was called 1.4a1r or something,
where i indeed could use this xterm. When i updated to 1.3.2, i sort
of forgot about it again...

Another question though:
You said "If it includes the -xterm option, then that option gets
applied to the dynamically spawned procs too"
Does this passing on also apply to the -x options?

Thanks
  Jody

On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain  wrote:
> It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at 
> your request :-)
>
> On Dec 16, 2009, at 7:20 AM, jody wrote:
>
>> Thanks for your reply
>>
>> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not
>> to recognize the --xterm option.
>> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf
>> --
>> mpirun was unable to launch the specified application as it could not
>> find an executable:
>>
>> Executable: 1
>> Node: aim-plankton.uzh.ch
>>
>> while attempting to start process rank 0.
>> --
>> (if i reverse the --xterm and -np 1, it complains about not finding
>> executable '9')
>> Do i need to install a higher version, or is this something i'd have
>> to set as option in configure?
>>
>> Thank You
>>  Jody
>>
>> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain  wrote:
>>> Depends on the version you are working with. If it includes the -xterm 
>>> option, then that option gets applied to the dynamically spawned procs too, 
>>> so this should be automatically taken care of...but in that case, you 
>>> wouldn't need your script to open an xterm anyway. You would just do:
>>>
>>> mpirun --xterm -np 5 gdb ./my_app
>>>
>>> or the equivalent. You would then comm_spawn an argv[0] of "gdb", with 
>>> argv[1] being your target app.
>>>
>>> I don't know how to avoid including that "gdb" in the comm_spawn argv's - I 
>>> once added an mpirun cmd line option to automatically add it, but got 
>>> loudly told to remove it.  Of course, it should be easy to pass an option 
>>> to your app itself that tells it whether or not to do so!
>>>
>>> HTH
>>> Ralph
>>>
>>>
>>> On Dec 16, 2009, at 4:06 AM, jody wrote:
>>>
>>>> Hi
>>>> Until now i always wrote applications for which the number of processes
>>>> was given on the command line with -np.
>>>> To debug these applications i wrote a script, run_gdb.sh which basically
>>>> open a xterm and starts gdb in it for my application.
>>>> This allowed me to have a window for each of the processes being debugged.
>>>>
>>>> Now, however, i write my first application in which additional processes 
>>>> are
>>>> being spawned. My question is now: how can i open xterm windows in which
>>>> gdb runs for the spawned processes?
>>>>
>>>> The only way i can think of is to pass my script run_gdb.sh into the argv
>>>> parameters of MPI_Spawn.
>>>> Would this be correct?
>>>> If yes, what about other parameters passed to the spawning process, such as
>>>> environment variables passed via -x? Are they being passed to the spawned
>>>> processes as well? In my case this would be necessary so that processes
>>>> on other machine will get the $DISPLAY environment variable in order to
>>>> display their xterms with gdb on my workstation.
>>>>
>>>> Another negative point would be the need to change the argv parameters
>>>> every time one switches between debugging and normal running.
>>>>
>>>> Has anybody got some hints on how to debug spawned processes?
>>>>
>>>> Thank You
>>>>  Jody
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Debugging spawned processes

2009-12-16 Thread jody

Thanks for your reply

That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not
to recognize the --xterm option.
[jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf
--
mpirun was unable to launch the specified application as it could not
find an executable:

Executable: 1
Node: aim-plankton.uzh.ch

while attempting to start process rank 0.
--
(if i reverse the --xterm and -np 1, it complains about not finding
executable '9')
Do i need to install a higher version, or is this something i'd have
to set as option in configure?

Thank You
  Jody

On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain  wrote:
> Depends on the version you are working with. If it includes the -xterm 
> option, then that option gets applied to the dynamically spawned procs too, 
> so this should be automatically taken care of...but in that case, you 
> wouldn't need your script to open an xterm anyway. You would just do:
>
> mpirun --xterm -np 5 gdb ./my_app
>
> or the equivalent. You would then comm_spawn an argv[0] of "gdb", with 
> argv[1] being your target app.
>
> I don't know how to avoid including that "gdb" in the comm_spawn argv's - I 
> once added an mpirun cmd line option to automatically add it, but got loudly 
> told to remove it.  Of course, it should be easy to pass an option to your 
> app itself that tells it whether or not to do so!
>
> HTH
> Ralph
>
>
> On Dec 16, 2009, at 4:06 AM, jody wrote:
>
>> Hi
>> Until now i always wrote applications for which the number of processes
>> was given on the command line with -np.
>> To debug these applications i wrote a script, run_gdb.sh which basically
>> open a xterm and starts gdb in it for my application.
>> This allowed me to have a window for each of the processes being debugged.
>>
>> Now, however, i write my first application in which additional processes are
>> being spawned. My question is now: how can i open xterm windows in which
>> gdb runs for the spawned processes?
>>
>> The only way i can think of is to pass my script run_gdb.sh into the argv
>> parameters of MPI_Spawn.
>> Would this be correct?
>> If yes, what about other parameters passed to the spawning process, such as
>> environment variables passed via -x? Are they being passed to the spawned
>> processes as well? In my case this would be necessary so that processes
>> on other machine will get the $DISPLAY environment variable in order to
>> display their xterms with gdb on my workstation.
>>
>> Another negative point would be the need to change the argv parameters
>> every time one switches between debugging and normal running.
>>
>> Has anybody got some hints on how to debug spawned processes?
>>
>> Thank You
>>  Jody
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Debugging spawned processes

2009-12-16 Thread jody

Hi
Until now i always wrote applications for which the number of processes
was given on the command line with -np.
To debug these applications i wrote a script, run_gdb.sh which basically
open a xterm and starts gdb in it for my application.
This allowed me to have a window for each of the processes being debugged.

Now, however, i write my first application in which additional processes are
being spawned. My question is now: how can i open xterm windows in which
gdb runs for the spawned processes?

The only way i can think of is to pass my script run_gdb.sh into the argv
parameters of MPI_Spawn.
Would this be correct?
If yes, what about other parameters passed to the spawning process, such as
environment variables passed via -x? Are they being passed to the spawned
processes as well? In my case this would be necessary so that processes
on other machine will get the $DISPLAY environment variable in order to
display their xterms with gdb on my workstation.

Another negative point would be the need to change the argv parameters
every time one switches between debugging and normal running.

Has anybody got some hints on how to debug spawned processes?

Thank You
  Jody

Re: [OMPI users] Open MPI Query

2009-11-24 Thread jody

Hi

>> 2) Does MPI_Send() and MPI_Recv() calls send message from process on
>> one machine to
>>    process on another machine ? If yes, then how can I achieve this ?
>
> Take a look at what the example codes are doing.  Read man mpirun.  Wait
> for someone here to point you to an MPI primer or tute.
>

Have a look at the Open MPI FAQ:
  http://www.open-mpi.org/faq/?category=running
It shows you how to run a Open-MPI program on single or multiple machines

Jody

Re: [OMPI users] OMPI-1.2.0 is not getting installed

2009-10-21 Thread jody

Sorry, i can't help you here.
I have no experience with neither intel compilers nor IB

Jody

On Wed, Oct 21, 2009 at 4:14 AM, Sangamesh B  wrote:
>
>
> On Tue, Oct 20, 2009 at 6:48 PM, jody  wrote:
>>
>> Hi
>> Just curious:
>> Is there a particular reason why you want version 1.2?
>
> Yes. Our cluster is installed with Intel MKL-10.0. This version of MKL
> contains a static blacs library which is compatible with OMPI-1.2 as told by
> Intel support team.
>
> http://software.intel.com/en-us/forums/intel-math-kernel-library/topic/69104/
>
> Is it possible to get it installed?
> Thanks,
> Sangamesh
>>
>> The current version is 1.3.3!
>>
>> Jody
>>
>> On Tue, Oct 20, 2009 at 2:48 PM, Sangamesh B  wrote:
>> > Hi,
>> >
>> >  Its required here to install Open MPI 1.2 on a HPC cluster with -
>> > Cent
>> > OS 5.2 Linux, Mellanox IB card, switch and OFED-1.4.
>> > But the configure is failing with:
>> >
>> > [root@master openmpi-1.2]# ./configure
>> > --prefix=/opt/mpi/openmpi/1.2/intel
>> > --with-openib=/usr
>> > ..
>> > ...
>> >
>> > --- MCA component btl:openib (m4 configuration macro)
>> > checking for MCA component btl:openib compile mode... dso
>> > checking for sysfs_open_class in -lsysfs... no
>> > configure: error: OpenIB support requested but required sysfs not found.
>> > Aborting
>> >
>> > even though the required rpms are available:
>> >
>> > # rpm -qa | grep sysfs
>> > sysfsutils-2.0.0-6
>> > libsysfs-2.0.0-6
>> > libsysfs-2.0.0-6
>> >
>> >
>> > What to do get install OMPI-1.2 specifically?
>> >
>> > Thanks
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OMPI-1.2.0 is not getting installed

2009-10-20 Thread jody

Hi
Just curious:
Is there a particular reason why you want version 1.2?
The current version is 1.3.3!

Jody

On Tue, Oct 20, 2009 at 2:48 PM, Sangamesh B  wrote:
> Hi,
>
>  Its required here to install Open MPI 1.2 on a HPC cluster with - Cent
> OS 5.2 Linux, Mellanox IB card, switch and OFED-1.4.
> But the configure is failing with:
>
> [root@master openmpi-1.2]# ./configure --prefix=/opt/mpi/openmpi/1.2/intel
> --with-openib=/usr
> ..
> ...
>
> --- MCA component btl:openib (m4 configuration macro)
> checking for MCA component btl:openib compile mode... dso
> checking for sysfs_open_class in -lsysfs... no
> configure: error: OpenIB support requested but required sysfs not found.
> Aborting
>
> even though the required rpms are available:
>
> # rpm -qa | grep sysfs
> sysfsutils-2.0.0-6
> libsysfs-2.0.0-6
> libsysfs-2.0.0-6
>
>
> What to do get install OMPI-1.2 specifically?
>
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] how to set up the cluster of 5 nodes in openmpi

2009-09-30 Thread jody

Hi
All of your questions are answered in the FAQ...

If you have a TCP/IP connection between your machines so that each
machine can reach every other one,
that will be ok.

First make sure you can get access from each machine to every other
one using ssh without a password.
See the FAQ:
  http://www.open-mpi.org/faq/?category=rsh

Make sure to set PATH and LD_LIBRARY_PATH as described in the FAQ:
  http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Next, make sure your application is accessible by all of your
machines. I use an nfs directory shared by all my machines,
and that is where i put the application.

To start your application, follow the instructions in the FAQ:
   http://www.open-mpi.org/faq/?category=running

If you want to use host files, read about how to use them in the FAQ:
  http://www.open-mpi.org/faq/?category=running#mpirun-host

Hope that helps

Jody

On Wed, Sep 30, 2009 at 11:00 AM, ankur pachauri
 wrote:
> Dear all,
>
> I have been able to install open mpi on two independent machines having FC
> 10. The simple hello world programms are running fine on the independent
> machinesBut can any one pls help me by letting me know how to connect
> the two machines and run a common program between the twohow do we a do
> a lamboot -v lamhosts in case of openmpi?
> How do we get the open mpi running on the two computers simultaneously and
> excute a common program on the two machines.
>
> Thanks in advance
>
>
> On Wed, Sep 30, 2009 at 12:24 PM, jody  wrote:
>>
>> Hi
>> Have look at the Open MPI FAQ:
>>
>>  http://www.open-mpi.org/faq/
>>
>> It gives you all the information you need to start working with your
>> cluster.
>>
>> Jody
>>
>>
>> On Wed, Sep 30, 2009 at 8:25 AM, ankur pachauri 
>> wrote:
>> > dear all,
>> >
>> > i am new to openmpi, all that i need is to set up the cluster of around
>> > 5
>> > nodes in my lab, i am using fedora 7 in the lab. so i'll be thankfull to
>> > you
>> > if let me know the steps or the procedure to setup the cluster(as in
>> > case of
>> > lam/mpi- passwordless ssh or nfs mount and ...).
>> >
>> > regards,
>> >
>> > --
>> > Ankur Pachauri.
>> > 09927590910
>> >
>> > Research Scholar,
>> > software engineering.
>> > Department of Mathematics
>> > Dayalbagh Educational Institute
>> > Dayalbagh,
>> > AGRA
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Ankur Pachauri.
> 09927590910
>
> Research Scholar,
> software engineering.
> Department of Mathematics
> Dayalbagh Educational Institute
> Dayalbagh,
> AGRA
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] how to set up the cluster of 5 nodes in openmpi

2009-09-30 Thread jody

Hi
Have look at the Open MPI FAQ:

  http://www.open-mpi.org/faq/

It gives you all the information you need to start working with your cluster.

Jody


On Wed, Sep 30, 2009 at 8:25 AM, ankur pachauri  wrote:
> dear all,
>
> i am new to openmpi, all that i need is to set up the cluster of around 5
> nodes in my lab, i am using fedora 7 in the lab. so i'll be thankfull to you
> if let me know the steps or the procedure to setup the cluster(as in case of
> lam/mpi- passwordless ssh or nfs mount and ...).
>
> regards,
>
> --
> Ankur Pachauri.
> 09927590910
>
> Research Scholar,
> software engineering.
> Department of Mathematics
> Dayalbagh Educational Institute
> Dayalbagh,
> AGRA
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI_Irecv segmentation fault

2009-09-22 Thread jody

Did you also change the "&buffer" to buffer in your MPI_Send call?

Jody

On Tue, Sep 22, 2009 at 1:38 PM, Everette Clemmer  wrote:
> Hmm, tried changing MPI_Irecv( &buffer) to MPI_Irecv( buffer...)
> and still no luck. Stack trace follows if that's helpful:
>
> prompt$ mpirun -np 2 ./display_test_debug
> Sending 'q' from node 0 to node 1
> [COMPUTER:50898] *** Process received signal ***
> [COMPUTER:50898] Signal: Segmentation fault (11)
> [COMPUTER:50898] Signal code:  (0)
> [COMPUTER:50898] Failing at address: 0x0
> [COMPUTER:50898] [ 0] 2   libSystem.B.dylib
> 0x7fff87e280aa _sigtramp + 26
> [COMPUTER:50898] [ 1] 3   ???
> 0x 0x0 + 0
> [COMPUTER:50898] [ 2] 4   GLUT
> 0x000100024a21 glutMainLoop + 261
> [COMPUTER:50898] [ 3] 5   display_test_debug
> 0x00011444 xsMainLoop + 67
> [COMPUTER:50898] [ 4] 6   display_test_debug
> 0x00011335 main + 59
> [COMPUTER:50898] [ 5] 7   display_test_debug
> 0x00010d9c start + 52
> [COMPUTER:50898] [ 6] 8   ???
> 0x0001 0x0 + 1
> [COMPUTER:50898] *** End of error message ***
> mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local
> exited on signal 15 (Terminated).
> 1 additional process aborted (not shown)
>
> Thanks,
> Everette
>
>
> On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren  
> wrote:
>> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote:
>>> Hey all,
>>>
>>> I'm getting a segmentation fault when I attempt to receive a single
>>> character via MPI_Irecv. Code follows:
>>>
>>> void recv_func() {
>>>               if( !MASTER ) {
>>>                       char            buffer[ 1 ];
>>>                       int             flag;
>>>                       MPI_Request request;
>>>                       MPI_Status      status;
>>>
>>>                       MPI_Irecv( &buffer, 1, MPI_CHAR, 0, MPI_ANY_TAG, 
>>> MPI_COMM_WORLD, &request);
>>
>> It should be MPI_Irecv(buffer, 1, ...)
>>
>>> The segfault disappears if I comment out the MPI_Irecv call in
>>> recv_func so I'm assuming that there's something wrong with the
>>> parameters that I'm passing to it. Thoughts?
>>
>> --
>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
> - Everette
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Timers

2009-09-11 Thread jody

Hi
I'm not sure if i completely understand your requirements,
but have you tried MPI_WTime?

Jody

On Fri, Sep 11, 2009 at 7:54 AM, amjad ali  wrote:
> Hi all,
> I want to get the elapsed time from start to end of my parallel program
> (OPENMPI based). It should give same time for the same problem always;
> irrespective of whether the nodes are running some or programs or they are
> running only that program. How to do this?
>
> Regards.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Problem with linking on OS X

2009-08-19 Thread Jody Klymak



Hi Tomek,

Did you try mpicc --showme?

I get:

gcc -D_REENTRANT -I/Network/Xgrid/openmpi/include -L/Network/Xgrid/ 
openmpi/lib -lmpi -lopen-rte -lopen-pal -lutil


If your -L isn't correct in there, then I would guess your  
configuration found the wrong library somehow when you compiled mpicc  
and friends...


Cheers,  Jody

On Aug 19, 2009, at  15:57 PM, tomek wrote:

OK - I have fixed it by including -L/opt/openmpi/lib at the very  
beginning of mpicc ... -L/opt/openmpi/lib -o app.exe the rest ...


But something is wrong with dyld anyhow.

On 19 Aug 2009, at 21:04, Jody Klymak wrote:


Hi Tomek,

I'm using 10.5.7, and just went through a painful process that we  
thought was library related (but it wasn't), so I'll give my less- 
than-learned response, and if you sill have difficulties hopefully  
others will chime in:


What is the result of "which mpicc" (or whatever you are using for  
your compiling/linking?  I'm pretty sure that's where the library  
paths get set, and if you are calling /usr/bin/mpicc you will get  
the wrong library paths in the executable.


On Aug 19, 2009, at  10:57 AM, tomek wrote:


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] Problem with linking on OS X

2009-08-19 Thread Jody Klymak


Hi Tomek,

I'm using 10.5.7, and just went through a painful process that we  
thought was library related (but it wasn't), so I'll give my less-than- 
learned response, and if you sill have difficulties hopefully others  
will chime in:


What is the result of "which mpicc" (or whatever you are using for  
your compiling/linking?  I'm pretty sure that's where the library  
paths get set, and if you are calling /usr/bin/mpicc you will get the  
wrong library paths in the executable.


On Aug 19, 2009, at  10:57 AM, tomek wrote:


1. Using DYLD_LIBRARY_PATH
2. passing some ./configure --with-wrapper-ldflags="-L/opt/openmpi/ 
lib"
3. passing some ./configure --with-wrapper-ldflags="-rpath /opt/ 
openmpi/lib"

4. hand compilation with cc -L/opt/openmpi/lib -lmpi

2 and 3 did not work (ld error=22)

With 1 and 2 my code still gets linked with /usr/lib/libmpi...

Note, that the /opt/openmpi/bin path is properly set and ompi_info  
does outputs the right info.


You do not need to set DYLD_LIBRARY_PATH.  I don't have it set and my  
mpi applications run fine.


Did 4 work?

Cheers,  Jody


--
Jody Klymak
http://web.uvic.ca/~jklymak/

Re: [OMPI users] --rankfile

2009-08-19 Thread jody

Hi
I had a similar problem.
Following a suggestion from Lenny,
i removed the "max-slots" entries from
my hostsfile and it worked.

It seems that there still are some minor bugs in the rankfile mechanism.
See the post

http://www.open-mpi.org/community/lists/users/2009/08/10384.php


Jody


On Tue, Aug 18, 2009 at 10:53 PM, Nulik Nol wrote:
> Hi,
> i get this error when i use --rankfile,
> "There are not enough slots available in the system to satisfy the 2 slots"
> what could be the problem? I have tried using '*' for 'slot' param and
> many other configs without any luck. Wihtout --rankfile everything
> works fine. Will appriciate any help.
>
> master waver # cat neat.hostfile
> n64 max-slots=1 slots=1
> master max-slots=1 slots=1
> master waver # cat neat.rankfile
> rank 0=n64 slot=0
> rank 1=master slot=0
> master waver # mpirun  --rankfile neat.rankfile --hostfile
> neat.hostfile -n 2 /tmp/neat
> --
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
>    /tmp/neat
>
> Either request fewer slots for your application, or make more slots available
> for use.
>
> --
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
> master waver # mpirun   --hostfile neat.hostfile -n 2 /tmp/neat
> entering master main loop
> recieved msg from 1
> unknown message 0
> ^Cmpirun: killing job...
>
> --
> mpirun noticed that process rank 1 with PID 13064 on node master
> exited on signal 0 (Unknown signal 0).
> --
> 2 total processes killed (some possibly by mpirun during cleanup)
> mpirun: clean termination accomplished
>
> master waver #
>
>
> --
> ==
> The power of zero is infinite
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] rank file error: Rankfile claimed...

2009-08-17 Thread jody

Hi Lenny
After removing the max-slots entries,
i could do
  mpirun -np 4 -hostfile th_02  -rf rf_02 ./HelloMPI
without any errors.

But can you explain what the meaning of the max-slots entry is?
I checked the FAQs
  http://www.open-mpi.org/faq/?category=running#simple-spmd-run
  http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
but i couldn't find any explanation. (furthermore, in the FAQ it says
"max-slots"
in one place, but "max_slots" in the other one)

Thank You
  Jody


On Mon, Aug 17, 2009 at 3:29 PM, Lenny
Verkhovsky wrote:
> can you try not specifiyng "max-slots" in the hostfile.
> if you are the only user of the nodes, there will be no oversibscibing of
> the processors.
> This one definetly looks like a bug,
> but as Ralph said there is a current disscusion and working on this
> component.
> Lenny.
> On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain  wrote:
>>>
>>> Is there an explanation for this?
>>
>> I believe the word is "bug". :-)
>>
>> The rank_file mapper has been substantially revised lately - we are
>> discussing now how much of that revision to bring to 1.3.4 versus the next
>> major release.
>>
>> Ralph
>>
>> On Aug 17, 2009, at 4:45 AM, jody wrote:
>>
>>> Hi Lenny
>>>
>>>> I think it has something to do with your environment,  /etc/hosts, IT
>>>> setup,
>>>> hostname function return value e.t.c
>>>> I am not sure if it has something to do with Open MPI at all.
>>>
>>> OK. I just thought this was Open MPI related because i was able to use
>>> the
>>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
>>> the host file...
>>>
>>> However, I encountered a new problem:
>>> if the rankfile lists all the entries which occur in the host file
>>> there is an error message.
>>> In the following example, the hostfile is
>>> [jody@plankton neander]$ cat th_02
>>> nano_00.uzh.ch  slots=2 max-slots=2
>>> nano_02.uzh.ch  slots=2 max-slots=2
>>>
>>> and the rankfile is:
>>> [jody@plankton neander]$ cat rf_02
>>> rank  0=nano_00.uzh.ch  slot=0
>>> rank  2=nano_00.uzh.ch  slot=1
>>> rank  1=nano_02.uzh.ch  slot=0
>>> rank  3=nano_02.uzh.ch  slot=1
>>>
>>> Here is the error:
>>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02
>>> ./HelloMPI
>>>
>>> --
>>> There are not enough slots available in the system to satisfy the 4 slots
>>> that were requested by the application:
>>>   ./HelloMPI
>>>
>>> Either request fewer slots for your application, or make more slots
>>> available
>>> for use.
>>>
>>>
>>> --
>>>
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>>> the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>>
>>> --
>>>
>>> --
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>>
>>> --
>>> mpirun: clean termination accomplished
>>>
>>> If i use a hostfile with one more entry
>>> [jody@aim-plankton neander]$ cat th_021
>>> aim-nano_00.uzh.ch  slots=2 max-slots=2
>>> aim-nano_02.uzh.ch  slots=2 max-slots=2
>>> aim-nano_01.uzh.ch  slots=1 max-slots=1
>>>
>>> Then this works fine:
>>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02
>>> ./HelloMPI
>>>
>>> Is there an explanation for this?
>>>
>>> Thank You
>>>  Jody
>>>
>>>> Lenny.
>>>> On Mon, Aug 17, 2009 at 12:59 PM, jody  wrote:
>>

Re: [OMPI users] rank file error: Rankfile claimed...

2009-08-17 Thread jody

Hi Lenny

> I think it has something to do with your environment,  /etc/hosts, IT setup,
> hostname function return value e.t.c
> I am not sure if it has something to do with Open MPI at all.

OK. I just thought this was Open MPI related because i was able to use the
 aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in
the host file...

However, I encountered a new problem:
if the rankfile lists all the entries which occur in the host file
there is an error message.
In the following example, the hostfile is
[jody@plankton neander]$ cat th_02
nano_00.uzh.ch  slots=2 max-slots=2
nano_02.uzh.ch  slots=2 max-slots=2

and the rankfile is:
[jody@plankton neander]$ cat rf_02
rank  0=nano_00.uzh.ch  slot=0
rank  2=nano_00.uzh.ch  slot=1
rank  1=nano_02.uzh.ch  slot=0
rank  3=nano_02.uzh.ch  slot=1

Here is the error:
[jody@plankton neander]$ mpirun -np 4 -hostfile th_02  -rf rf_02 ./HelloMPI
--
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
./HelloMPI

Either request fewer slots for your application, or make more slots available
for use.

--
--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished

If i use a hostfile with one more entry
[jody@aim-plankton neander]$ cat th_021
aim-nano_00.uzh.ch  slots=2 max-slots=2
aim-nano_02.uzh.ch  slots=2 max-slots=2
aim-nano_01.uzh.ch  slots=1 max-slots=1

Then this works fine:
[jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021  -rf rf_02 ./HelloMPI

Is there an explanation for this?

Thank You
  Jody

> Lenny.
> On Mon, Aug 17, 2009 at 12:59 PM, jody  wrote:
>>
>> Hi Lenny
>>
>> Thanks - using the full names makes it work!
>> Is there a reason why the rankfile option treats
>> host names differently than the hostfile option?
>>
>> Thanks
>>  Jody
>>
>>
>>
>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny
>> Verkhovsky wrote:
>> > Hi
>> > This message means
>> > that you are trying to use host "plankton", that was not allocated via
>> > hostfile or hostlist.
>> > But according to the files and command line, everything seems fine.
>> > Can you try using "plankton.uzh.ch" hostname instead of "plankton".
>> > thanks
>> > Lenny.
>> > On Mon, Aug 17, 2009 at 10:36 AM, jody  wrote:
>> >>
>> >> Hi
>> >>
>> >> When i use a rankfile, i get an error message which i don't understand:
>> >>
>> >> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts
>> >> ./HelloMPI
>> >>
>> >> --
>> >> Rankfile claimed host plankton that was not allocated or
>> >> oversubscribed it's slots:
>> >>
>> >>
>> >> --
>> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >> file rmaps_rank_file.c at line 108
>> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >> file base/rmaps_base_map_job.c at line 87
>> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >> file base/plm_base_launch_support.c at line 77
>> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >> file plm_rsh_module.c at line 990
>> >>
>> >> --
>> >> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
>> >> to
>> >> launch so we are aborting.
>> >>
>> >> There may be more information reported by the envir

1 2 3 >

1 - 100 of 247 matches

Mail list logo