Re: [OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used
Actually all machines use iptables as firewall. I compared the rules triops and kraken use and found that triops had the line REJECT all -- anywhere anywhere reject-with icmp-host-prohibited which kraken did not have (otherwise they were identical). I removed that line from triops' rules, restarted iptables and now communication works in all directions! Thank You Jody On Tue, May 3, 2016 at 7:00 PM, Jeff Squyres (jsquyres) wrote: > Have you disabled firewalls between these machines? > > > On May 3, 2016, at 11:26 AM, jody wrote: > > > > ...my bad! > > > > I had set up things so that PATH and LD_LIBRARY_PATH were correct in > interactive mode, > > but they were wrong ssh was called non-interactively. > > > > Now i have a new problem: > > When i do > > mpirun -np 6 --hostfile krakenhosts hostname > > from triops, sometimes it seems to hang (i.e. no output, doesn't end) > > and at other time i get the ouput > > > > [aim-kraken:24527] [[45056,0],1] tcp_peer_send_blocking: send() to > socket 9 failed: Broken pipe (32) > > > -- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > ... > > > -- > > - > > Again, i can call mpirun on triops from kraken und all squid_XX without > a problem... > > > > What could cause this problem? > > > > Thank You > > Jody > > > > > > On Tue, May 3, 2016 at 2:54 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Have you verified that you are running the same version of Open MPI on > both servers when launched from non-interactive logins? > > > > This kind of error is somewhat typical if you accidentally mixed, for > example, Open MPI v1.6.x and v1.10.2 (i.e., v1.10.2 understands the > --hnp-topo-sig back end option, but v1.6.x does not). > > > > > > > On May 3, 2016, at 6:35 AM, jody wrote: > > > > > > Hi > > > I have installed Open MPI v 1.10.2 on two machines today using only > the prefix-option for configure, and then doing 'make all install'. > > > > > > On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH > correctly. > > > (I checked by running 'mpirun --version' and verifying that the output > does indeed say 1.10.2) > > > > > > Password-less ssh is enabled on both machines in both directions. > > > > > > When i start mpirun form one machine (kraken) with a hostfile > specifying the other machine ("triops slots=8 max-slots=8), > > > it works: > > > - > > > jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime > > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > > - > > > > > > But when i start mpirun form triops with a hostfile specifying kraken > ("kraken slots=8 max-slots=8"), > > > it fails: > > > - > > > jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname > > > [aim-kraken:21973] Error: unknown option "--hnp-topo-sig" > > > input in flex scanner failed > > > > -- > > > ORTE was unable to reliably start one or more daemons. > > > This usually is caused by: > > > > > > * not finding the required libraries and/or binaries on > > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > > > * lack of authority to execute on one or more specified nodes. > > > Please verify your allocation and authorities. > > > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > > Please check with your sys admin to determine the correct location > to use. > > > > > > * compilation of the orted with dynamic libraries when static are > required > > > (e.g., on Cray). Please check your configure cmd line and consider > using > > > one of the contrib/platform definitions for your system type. > > > > > > * an inability to create a connection back to mpirun due to a > > > lack of common network interface
Re: [OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used
...my bad! I had set up things so that PATH and LD_LIBRARY_PATH were correct in interactive mode, but they were wrong ssh was called non-interactively. Now i have a new problem: When i do mpirun -np 6 --hostfile krakenhosts hostname from triops, sometimes it seems to hang (i.e. no output, doesn't end) and at other time i get the ouput [aim-kraken:24527] [[45056,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) -- ORTE was unable to reliably start one or more daemons. This usually is caused by: ... -- - Again, i can call mpirun on triops from kraken und all squid_XX without a problem... What could cause this problem? Thank You Jody On Tue, May 3, 2016 at 2:54 PM, Jeff Squyres (jsquyres) wrote: > Have you verified that you are running the same version of Open MPI on > both servers when launched from non-interactive logins? > > This kind of error is somewhat typical if you accidentally mixed, for > example, Open MPI v1.6.x and v1.10.2 (i.e., v1.10.2 understands the > --hnp-topo-sig back end option, but v1.6.x does not). > > > > On May 3, 2016, at 6:35 AM, jody wrote: > > > > Hi > > I have installed Open MPI v 1.10.2 on two machines today using only the > prefix-option for configure, and then doing 'make all install'. > > > > On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH > correctly. > > (I checked by running 'mpirun --version' and verifying that the output > does indeed say 1.10.2) > > > > Password-less ssh is enabled on both machines in both directions. > > > > When i start mpirun form one machine (kraken) with a hostfile specifying > the other machine ("triops slots=8 max-slots=8), > > it works: > > - > > jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 > > - > > > > But when i start mpirun form triops with a hostfile specifying kraken > ("kraken slots=8 max-slots=8"), > > it fails: > > - > > jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname > > [aim-kraken:21973] Error: unknown option "--hnp-topo-sig" > > input in flex scanner failed > > > -- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > > * compilation of the orted with dynamic libraries when static are > required > > (e.g., on Cray). Please check your configure cmd line and consider > using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > > -- > > > > The same error happens when i use '--host kraken'. > > > > I verified that PATH and LD_LIBRARY_PATH are correctly set on both > machines. > > And on both machines /tmp is readable, writeable and executable for all. > > The connection should be okay (i can do a ssh from kraken to triops and > vice versa). > > > > Any idea what the problem is? > > > > Thank You > > Jody > > > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29074.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29075.php >
[OMPI users] mpirun gives error when option '--hostfiles' or '--hosts' is used
Hi I have installed Open MPI v 1.10.2 on two machines today using only the prefix-option for configure, and then doing 'make all install'. On both machines i changed .bashrc to set PATH and LD_LIBRARY_PATH correctly. (I checked by running 'mpirun --version' and verifying that the output does indeed say 1.10.2) Password-less ssh is enabled on both machines in both directions. When i start mpirun form one machine (kraken) with a hostfile specifying the other machine ("triops slots=8 max-slots=8), it works: - jody@kraken ~ $ mpirun -np 3 --hostfile triopshosts uptime 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 12:24:04 up 7 days, 43 min, 17 users, load average: 0.06, 0.68, 0.65 - But when i start mpirun form triops with a hostfile specifying kraken ("kraken slots=8 max-slots=8"), it fails: - jody@triops ~ $ mpirun -np 3 --hostfile krakenhosts hostname [aim-kraken:21973] Error: unknown option "--hnp-topo-sig" input in flex scanner failed -- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- The same error happens when i use '--host kraken'. I verified that PATH and LD_LIBRARY_PATH are correctly set on both machines. And on both machines /tmp is readable, writeable and executable for all. The connection should be okay (i can do a ssh from kraken to triops and vice versa). Any idea what the problem is? Thank You Jody
Re: [OMPI users] run a program
Hi Raha Yes, that is correct. You have to make sure that max-slots is less or equal to the number of cpus in the node to avoid oversubscribing it. Have a look at the other entries in the FAQ, they give information on many other options you can use. http://www.open-mpi.org/faq/?category=running Jody On Wed, Feb 26, 2014 at 10:38 AM, raha khalili wrote: > Dear Jody > > Thank you for your reply. Based on hostfile examples you show me, I > understand 'slots' is number of cpus of each node I mentioned in the file, > am I true? > > Wishes > > > On Wed, Feb 26, 2014 at 1:02 PM, jody wrote: > >> Hi >> I think you should use the "--host" or "--hostfile" options: >> http://www.open-mpi.org/faq/?category=running#simple-spmd-run >> http://www.open-mpi.org/faq/?category=running#mpirun-host >> Hope this helps >> Jody >> >> >> On Wed, Feb 26, 2014 at 8:31 AM, raha khalili >> wrote: >> >>> Dear Users >>> >>> This is my first post in open-mpi forum and I am beginner in using mpi. >>> I want to run a program which does between 4 systems consist of one >>> server and three nodes with 20 cpus. When I run: *mpirun -np 20 >>> /home/khalili/espresso-5.0.2/bin/pw.x -in si.in <http://si.in> | tee >>> si.out*, after writing htop from terminal, it seems the program doesn't use >>> cpus >>> of three other nodes and just use the cpus of server. Could you tell me >>> please how do I can use all my cpus. >>> >>> Regards >>> -- >>> Khadije Khalili >>> Ph.D Student of Solid-State Physics >>> Department of Physics >>> University of Mazandaran >>> Babolsar, Iran >>> kh.khal...@stu.umz.ac.ir >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Khadije Khalili > Ph.D Student of Solid-State Physics > Department of Physics > University of Mazandaran > Babolsar, Iran > kh.khal...@stu.umz.ac.ir > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] run a program
Hi I think you should use the "--host" or "--hostfile" options: http://www.open-mpi.org/faq/?category=running#simple-spmd-run http://www.open-mpi.org/faq/?category=running#mpirun-host Hope this helps Jody On Wed, Feb 26, 2014 at 8:31 AM, raha khalili wrote: > Dear Users > > This is my first post in open-mpi forum and I am beginner in using mpi. > I want to run a program which does between 4 systems consist of one server > and three nodes with 20 cpus. When I run: *mpirun -np 20 > /home/khalili/espresso-5.0.2/bin/pw.x -in si.in <http://si.in> | tee si.out*, > after writing htop from terminal, it seems the program doesn't use cpus > of three other nodes and just use the cpus of server. Could you tell me > please how do I can use all my cpus. > > Regards > -- > Khadije Khalili > Ph.D Student of Solid-State Physics > Department of Physics > University of Mazandaran > Babolsar, Iran > kh.khal...@stu.umz.ac.ir > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI send recv confusion
Hi Pradeep I am not sure if this is the reason, but usually it is a bad idea to force an order of receives (such as you do in your receive loop - first from sender 1 then from sender 2 then from sender 3) Unless you implement it so, there is no guarantee the sends are performed in this order. B It is better if you accept messages from all senders (MPI_ANY_SOURCE) instead of particular ranks and then check where the message came from by examining the status fields (http://www.mpi-forum.org/docs/mpi22-report/node47.htm) Hope this helps Jody On Mon, Feb 18, 2013 at 5:06 PM, Pradeep Jha wrote: > I have attached a sample of the MPI program I am trying to write. When I run > this program using "mpirun -np 4 a.out", my output is: > > Sender:1 > Data received from1 > Sender:2 > Data received from1 > Sender:2 > > And the run hangs there. I dont understand why does the "sender" variable > change its value after MPI_recv? Any ideas? > > Thank you, > > Pradeep > > > program mpi_test > > include 'mpif.h' > > !( Initialize variables ) > integer, dimension(3) :: recv, send > > integer :: sender, np, rank, ierror > > call mpi_init( ierror ) > call mpi_comm_rank( mpi_comm_world, rank, ierror ) > call mpi_comm_size( mpi_comm_world, np, ierror ) > > !( Main program ) > > ! receive the data from the other processors > if (rank.eq.0) then > do sender = 1, np-1 > print *, "Sender: ", sender > call mpi_recv(recv, 3, mpi_int, sender, 1, > & mpi_comm_world, status, ierror) > print *, "Data received from ",sender > end do > end if > > ! send the data to the main processor > if (rank.ne.0) then > send(1) = 3 > send(2) = 4 > send(3) = 4 > call mpi_send(send, 3, mpi_int, 0, 1, mpi_comm_world, ierr) > end if > > > !( clean up ) > call mpi_finalize(ierror) > > return > end program mpi_test` > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpi job is blocked
Hi Richard When a collective call hangs, this usually means that one (or more) processes did not reach this command. Are you sure that all processes reach the allreduce statement? If something like this happens to me, i insert print statements just before the MPI-call so i can see which processes made it to this point and which ones did not. Hope this helps a bit Jody On Tue, Sep 25, 2012 at 8:20 AM, Richard wrote: > I have 3 computers with the same Linux system. I have setup the mpi cluster > based on ssh connection. > I have tested a very simple mpi program, it works on the cluster. > > To make my story clear, I name the three computer as A, B and C. > > 1) If I run the job with 2 processes on A and B, it works. > 2) if I run the job with 3 processes on A, B and C, it is blocked. > 3) if I run the job with 2 processes on A and C, it works. > 4) If I run the job with all the 3 processes on A, it works. > > Using gdb I found the line at which it is blocked, it is here > > #7 0x2ad8a283043e in PMPI_Allreduce (sendbuf=0x7fff09c7c578, > recvbuf=0x7fff09c7c570, count=1, datatype=0x627180, op=0x627780, > comm=0x627380) > at pallreduce.c:105 > 105 err = comm->c_coll.coll_allreduce(sendbuf, recvbuf, count, > > It seems that there is a communication problem between some computers. But > the above series of test cannot tell me what > exactly it is. Can anyone help me? thanks. > > Richard > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] deprecated MCA parameter
Thanks Ralph I renamed the parameter in my script, and now there are no more ugly messages :) Jody On Tue, Aug 28, 2012 at 3:17 PM, Ralph Castain wrote: > Ah, I see - yeah, the parameter technically is being renamed to > "orte_rsh_agent" to avoid having users need to know the internal topology of > the code base (i.e., that it is in the plm framework and the rsh component). > It will always be there, though - only the name is changing to protect the > innocent. :-) > > > On Aug 28, 2012, at 6:07 AM, jody wrote: > >> Hi Rallph >> >> I get one of these messages >> -- >> A deprecated MCA parameter value was specified in the environment or >> on the command line. Deprecated MCA parameters should be avoided; >> they may disappear in future releases. >> >> Deprecated parameter: plm_rsh_agent >> -- >> for every process that starts... >> >> My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1) >> >> jody >> >> On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain wrote: >>> Guess I'm confused - what is the issue here? The param still exists: >>> >>> MCA plm: parameter "plm_rsh_agent" (current value: >> rsh>, data source: default value, synonyms: >>> pls_rsh_agent, orte_rsh_agent) >>> The command used to launch executables on remote >>> nodes (typically either "ssh" or "rsh") >>> >>> I am unaware of any plans to deprecate it. Is there a problem with it? >>> >>> On Aug 28, 2012, at 2:24 AM, jody wrote: >>> >>>> Hi >>>> >>>> In order to open a xterm for each of my processes i use the MCA >>>> parameter 'plm_rsh_agent' >>>> like this: >>>> mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca >>>> plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e >>>> ./MPIProg >>>> >>>> Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows >>>> from the remote: >>>> >>>> jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts >>>> -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e >>>> ./MPIStruct >>>> xterm: Xt error: Can't open display: >>>> xterm: DISPLAY is not set >>>> xterm: Xt error: Can't open display: >>>> xterm: DISPLAY is not set >>>> xterm: Xt error: Can't open display: >>>> xterm: DISPLAY is not set >>>> xterm: Xt error: Can't open display: >>>> xterm: DISPLAY is not set >>>> xterm: Xt error: Can't open display: >>>> xterm: DISPLAY is not set >>>> -- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> -- >>>> >>>> Is there some replacement for this parameter, >>>> or how else can i get mpi to use" ssh -Y for" its connections? >>>> >>>> Thank You >>>> jody >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] deprecated MCA parameter
Hi Rallph I get one of these messages -- A deprecated MCA parameter value was specified in the environment or on the command line. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -- for every process that starts... My openmpi version is 1.6 (gentoo package sys-cluster/openmpi-1.6-r1) jody On Tue, Aug 28, 2012 at 2:38 PM, Ralph Castain wrote: > Guess I'm confused - what is the issue here? The param still exists: > > MCA plm: parameter "plm_rsh_agent" (current value: rsh>, data source: default value, synonyms: > pls_rsh_agent, orte_rsh_agent) > The command used to launch executables on remote > nodes (typically either "ssh" or "rsh") > > I am unaware of any plans to deprecate it. Is there a problem with it? > > On Aug 28, 2012, at 2:24 AM, jody wrote: > >> Hi >> >> In order to open a xterm for each of my processes i use the MCA >> parameter 'plm_rsh_agent' >> like this: >> mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca >> plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e >> ./MPIProg >> >> Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows >> from the remote: >> >> jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts >> -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e >> ./MPIStruct >> xterm: Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm: Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm: Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm: Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm: Xt error: Can't open display: >> xterm: DISPLAY is not set >> -- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -- >> >> Is there some replacement for this parameter, >> or how else can i get mpi to use" ssh -Y for" its connections? >> >> Thank You >> jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] deprecated MCA parameter
Hi In order to open a xterm for each of my processes i use the MCA parameter 'plm_rsh_agent' like this: mpirun -np 5 -hostfile allhosts-mca plm_base_verbose 1 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm -hold -e ./MPIProg Without the option ' -mca plm_rsh_agent "ssh -Y"' i can't open windows from the remote: jody@boss /mnt/data1/neander $ mpirun -np 5 -hostfile allhosts -mca plm_base_verbose 1 --leave-session-attached xterm -hold -e ./MPIStruct xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set xterm: Xt error: Can't open display: xterm: DISPLAY is not set -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- Is there some replacement for this parameter, or how else can i get mpi to use" ssh -Y for" its connections? Thank You jody
Re: [OMPI users] MPI_Irecv: Confusion with <> inputy parameter
Hi Devendra MPI has no way of knowing how big your receive buffer is - that's why you have to pass the "count" argument, to tell MPI how many items of your data type (in your case many bytes) it may copy to your receive buffer. When data arrives that is longer than the number you specified in the "count" argument, the data will be cut off after count bytes (and an error will be returned). Any shorter amount of data will be copied to your receive buffer and the call to MPI_Recv will terminate successfully. It is your responsibility to pass the correct value of "count". If you expect data of 160 bytes you have to allocate a buffer with a size greater or equal to 160 and you have to set your "count" parameter to the size you allocated. If you want to receive data in chunks, you have to send it in chunks. I hope this helps Jody On Tue, Aug 21, 2012 at 10:01 AM, devendra rai wrote: > Hello Jeff and Hristo, > > Now I am completely confused: > > So, let's say, the complete reception requires 8192 bytes. And, I have: > > MPI_Irecv( > (void*)this->receivebuffer,/* the receive buffer */ > this->receive_packetsize, /* 80 */ > MPI_BYTE, /* The data type > expected */ > this->transmittingnode,/* The node from which to > receive */ > this->uniquetag, /* Tag */ > MPI_COMM_WORLD, /* Communicator */ > &Irecv_request /* request handle */ > ); > > > That means, the the MPI_Test will tell me that the reception is complete > when I have received the first 80 bytes. Correct? > > Next, let[s say that I have a receive buffer with a capacity of 160 bytes, > then, will overflow error occur here? Even if I have decided to receive a > large payload in chunks of 80 bytes? > > I am sorry, the manual and the API reference was too vague for me. > > Thanks a lot > > Devendra > > From: "Iliev, Hristo" > To: Open MPI Users > Cc: devendra rai > Sent: Tuesday, 21 August 2012, 9:48 > Subject: Re: [OMPI users] MPI_Irecv: Confusion with <> inputy > parameter > > Jeff, > >>> Or is it the number of elements that are expected to be received, and >>> hence MPI_Test will tell me that the receive is not complete untill "count" >>> number of elements have not been received? >> >> Yes. > > Answering "Yes" this question might further the confusion there. The "count" > argument specifies the *capacity* of the receive buffer and the receive > operation (blocking or not) will complete successfully for any matching > message with size up to "count", even for an empty message with 0 elements, > and will produce an overflow error if the received message was longer and > data truncation has to occur. > > On 20.08.2012, at 16:32, Jeff Squyres wrote: > >> On Aug 20, 2012, at 5:51 AM, devendra rai wrote: >> >>> Is it the number of elements that have been received *thus far* in the >>> buffer? >> >> No. >> >>> Or is it the number of elements that are expected to be received, and >>> hence MPI_Test will tell me that the receive is not complete untill "count" >>> number of elements have not been received? >> >> Yes. >> >>> Here's the reason why I have a problem (and I think I may be completely >>> stupid here, I'd appreciate your patience): >> [snip] >>> Does anyone see what could be going wrong? >> >> Double check that the (sender_rank, tag, communicator) tuple that you >> issued in the MPI_Irecv matches the (rank, tag, communicator) tuple from the >> sender (tag and communicator are arguments on the sending side, and rank is >> the rank of the sender in that communicator). >> >> When receives block like this without completing like this, it usually >> means a mismatch between the tuples. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > Hristo Iliev, Ph.D. -- High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241 80 24367 -- Fax/UMS: +49 241 80 624367 > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Sharing (not copying) data with OpenMPI?
Hi Thank You all for your replies. I'll certainly look into the MPI 3.0 RMA link (out of pure interest) but i am afraid i can't go bleeding edge, because my application will also have to run on an other machine. As to OpenMP: i already make use of OpenMP in some places (for instance for the creation of the large data block), but unfortunately my main application is not well suited for OpenMP parallelization.. I guess i'll have to take more detailed look at my problem to see if i can restructure it in a good way... Thank You Jody On Mon, Apr 16, 2012 at 11:16 PM, Brian Austin wrote: > Maybe you meant to search for OpenMP instead of Open-MPI. > You can achieve something close to what you want by using OpenMP for on-node > parallelism and MPI for inter-node communication. > -Brian > > > > On Mon, Apr 16, 2012 at 11:02 AM, George Bosilca > wrote: >> >> No currently there is no way in MPI (and subsequently in Open MPI) to >> achieve this. However, in the next version of the MPI standard there will be >> a function allowing processes to shared a memory segment >> (https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/284). >> >> If you like living on the bleeding edge, you can try Brian's branch >> implementing the MPI 3.0 RMA operations (including the shared memory >> segment) from http://svn.open-mpi.org/svn/ompi/tmp-public/mpi3-onesided/. >> >> george. >> >> On Apr 16, 2012, at 09:52 , jody wrote: >> >> > Hi >> > >> > In my application i have to generate a large block of data (several >> > gigs) which subsequently has to be accessed by all processes (read >> > only), >> > Because of its size, it would take quite some time to serialize and >> > send the data to the different processes. Furthermore, i risk >> > running out of memory if this data is instantiated more than once on >> > one machine. >> > >> > Does OpenMPI offer some way of sharing data between processes (on the >> > same machine) without needing to send (and therefore copy) it? >> > >> > Or would i have to do this by means of creating shared memory, writing >> > to it, and then make it accessible for reading by the processes? >> > >> > Thank You >> > Jody >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Sharing (not copying) data with OpenMPI?
Hi In my application i have to generate a large block of data (several gigs) which subsequently has to be accessed by all processes (read only), Because of its size, it would take quite some time to serialize and send the data to the different processes. Furthermore, i risk running out of memory if this data is instantiated more than once on one machine. Does OpenMPI offer some way of sharing data between processes (on the same machine) without needing to send (and therefore copy) it? Or would i have to do this by means of creating shared memory, writing to it, and then make it accessible for reading by the processes? Thank You Jody
Re: [OMPI users] (no subject)
Hi Did you run your program with mpirun? For example: mpirun -np 4 ./a.out jody On Fri, Mar 16, 2012 at 7:24 AM, harini.s .. wrote: > Hi , > > I am very new to openMPI and I just installed openMPI 4.1.5 on Linux > platform. Now am trying to run the examples in the folder got > downloaded. But when i run , I got this > >>> a.out: error while loading shared libraries: libmpi.so.0: cannot open >>> shared object file: No such file or directory > > I got a.out when I compile hello_c.c using mpicc command. > please help me to resolve this problem. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] MPI_Intercomm_create hangs
Hi I've got a really strange problem: I've got an application which creates intercommunicators between a master and some workers. When i run it on our cluster with 11 processes it works, when i run it with 12 processes it hangs inside MPI_Intercomm_create(). This is the hostfile: squid_0.uzh.ch slots=3 max-slots=3 squid_1.uzh.ch slots=2 max-slots=2 squid_2.uzh.ch slots=1 max-slots=1 squid_3.uzh.ch slots=1 max-slots=1 triops.uzh.ch slots=8 max-slots=8 Actually all squid_X have 4 cores, but i managed to reduce the number of processes needed for failure by making the above settings. So with all available squid cores and 3 triops cores it works, but with 4 triops cores it hangs. On the other hand, if i use all 16 squid cores (but no triops cores) it works, too. If i start the application not from triopps, but froim another workstation, i have a similar pattern of Intercomm_create failures. Note that with the above hostfile a simple HelloMPI works also with 14 or more processes. The frustrating thing is that this exact same code has worked before! Does anybody have an explanation? Thank You I managed to simplify the application: #include #include "mpi.h" int main(int iArgC, char *apArgV[]) { int iResult = 0; int iNumProcs = 0; int iID = -1; MPI_Init(&iArgC, &apArgV); MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs); MPI_Comm_rank(MPI_COMM_WORLD, &iID); int iKey; if (iID == 0) { iKey = 0; } else { iKey = 1; } MPI_Comm commInter1; MPI_Comm commInter2; MPI_Comm commIntra; MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra); int iRankM; MPI_Comm_rank(commIntra, &iRankM); printf("Local rank: %d\n", iRankM); switch (iKey) { case 0: printf("Creating intercomm 1 for Master (%d)\n", iID); MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2); break; case 1: printf("Creating intercomm 1 for FH (%d)\n", iID); MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1); } printf("finalizing\n"); MPI_Finalize(); printf("exiting with %d\n", iResult); return iResult; }
Re: [OMPI users] Passwordless ssh
Hi You also must make sure that all slaves can connect via ssh to each other and to the master node without ssh. Jody On Wed, Dec 21, 2011 at 3:57 AM, Shaandar Nyamtulga wrote: > Can you clarify your answer please. > I have one master node and other slave nodes. I created rsa key on my master > node and copied it to all slaves. > /home/mpiuser directory of all nodes are shared through NFS.The strange > thing is why it requires password after I mount a slave and do ssh to the > slave. > When I dismount I can ssh without password. > > > > Date: Tue, 20 Dec 2011 10:45:12 +0100 > From: mathieu.westp...@obs.ujf-grenoble.fr > To: us...@open-mpi.org > Subject: Re: [OMPI users] Passwordless ssh > > > Hello > > You have to copy nodeX public key at the end of nodeY authorizedkeys. > > > Mathieu > Le 20/12/2011 05:03, Shaandar Nyamtulga a écrit : > > Hi > I built Beuwolf cluster using OpenMPI reading the following link. > http://techtinkering.com/2009/12/02/setting-up-a-beowulf-cluster-using-open-mpi-on-linux/ > I can access my nodes without password before mounting my slaves. > But when I mount my slaves and run a program, it asks again passwords. > > $ eval `ssh-agent` > > $ ssh-add ~/.ssh/id_dsa > > The above is not working. Terminal gives the reply "Could not open a > connection to your authentication agent." > > Help is needed urgently. > > Thank you > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ users mailing list > us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Error using hostfile
Hi If your LD_LIBRARY_PATH is not set for a non-interactive startup, then successful runs on the remote machines may not be sufficient evidence. Check this FAQ http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path To see if your variables are set correctly for non-interactive sessions on your nodes, you can execute mpirun --hostfile hostfile -np 4 printenv and scan the output for PATH and LD_LIBRARY_PATH. Hope this helps Jody On Sat, Jul 9, 2011 at 12:25 AM, Mohan, Ashwin wrote: > Thanks Ralph. > > > > I have emailed the network admin on the firewall issue. > > > > About the PATH and LIBRARY PATH issue, is it sufficient evidence that the > path are set alright if I am able to compile and run successfully on > individual nodes mentioned in the machine file. > > > > Thanks, > Ashwin. > > > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Friday, July 08, 2011 1:58 PM > To: Open MPI Users > Subject: Re: [OMPI users] Error using hostfile > > > > Is there a firewall in the way? The error indicates that daemons were > launched on the remote machines, but failed to communicate back. > > > > Also, check that your remote PATH and LD_LIBRARY_PATH are being set to the > right place to pickup this version of OMPI. Lots of systems deploy with > default versions that may not be compatible, so if you wind up running a > daemon on the remote node that comes from another installation, things won't > work. > > > > > > On Jul 8, 2011, at 10:52 AM, Mohan, Ashwin wrote: > > Hi, > > I am following up on a previous error posted. Based on the previous > recommendation, I did set up a password less SSH login. > > > > I created a hostfile comprising of 4 nodes (w/ each node having 4 slots). I > tried to run my job on 4 slots but get no output. Hence, I end up killing > the job. I am trying to run a simple MPI program on 4 nodes and trying to > figure out what could be the issue. What could I check to ensure that I can > run jobs on 4 nodes (each node has 4 slots) > > > > Here is the simple MPI program I am trying to execute on 4 nodes > > ** > > if (my_rank != 0) > > { > > sprintf(message, "Greetings from the process %d!", my_rank); > > dest = 0; > > MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, > MPI_COMM_WORLD); > > } > > else > > { > > for (source = 1;source < p; source++) > > { > > MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, > &status); > > printf("%s\n", message); > > } > > > > > > My hostfile looks like this: > > > > [amohan@myocyte48 ~]$ cat hostfile > > myocyte46 > > myocyte47 > > myocyte48 > > myocyte49 > > *** > > > > I use the following run command: : mpirun --hostfile hostfile -np 4 new46 > > And receive a blank screen. Hence, I have to kill the job. > > > > OUTPUT ON KILLING JOB: > > mpirun: killing job... > > -- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > -- > > -- > > mpirun was unable to cleanly terminate the daemons on the nodes shown > > below. Additional manual cleanup may be required - please refer to > > the "orte-clean" tool for assistance. > > -- > > myocyte46 - daemon did not report back when launched > > myocyte47 - daemon did not report back when launched > > myocyte49 - daemon did not report back when launched > > > > Thanks, > > Ashwin. > > > > > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, July 06, 2011 6:46 PM > To: Open MPI Users > Subject: Re: [OMPI users] Error using hostfile > > > > Please see http://www.open-mpi.org/faq/?category=rsh#ssh-keys > > > > > > On Jul 6, 2011, at 5:09 PM, Mohan, Ashwin wrote: > > > Hi, > > > > I use the following command (mpirun --prefix /usr/local/openmpi1.4.3 -np 4 > hello) to successfully execute a simple hello world command on a single > node. Each node has 4 slots. Following the successful execution on one >
Re: [OMPI users] a question about mpirun
Hi It seems that you have mixed an "old" LAM-MPI installation with OpenMPI. To make sure your OpenMPI installation is ok you could try to use the complete path to mpirun: /data1/cluster/openmpi/bin/mpirun -np 1 /tmp/openmpi-1.4.3/examples/ring_c You should also make sure that the compile-command is the one of OpenMPI and not of LAM MPI. ( /data1/cluster/openmpi/bin/mpiCC or something like that) Check your PATH environment variable to make sure it doesn't contain any of the LAM MPI directories, and make sure you set the LD_LIBRARY_PATH variable correctly (see http://www.open-mpi.org/faq/?category=running#run-prereqs) Hope this helps Jody On Thu, Jul 7, 2011 at 8:44 AM, zhuangchao wrote: > hello all : > > I installed the openmpi-1.4.3 on redhat as the following step : > > 1. ./configure --prefix=/data1/cluster/openmpi > > 2. make > > 3. make install > > And I compiled the examples of openmpi-1.4.3 as the following > step : > > 1. make > > Then I run the example : > > ./mpirun -np 1 /tmp/openmpi-1.4.3/examples/ring_c > > I get the following error : > > - > It seems that there is no lamd running on the host node1. > This indicates that the LAM/MPI runtime environment is not operating. > The LAM/MPI runtime environment is necessary for MPI programs to run > (the MPI program tired to invoke the "MPI_Init" function). > > Please run the "lamboot" command the start the LAM/MPI runtime > environment. See the LAM/MPI documentation for how to invoke > "lamboot" across multiple machines. > - > > I run openmpi , but I get the error from lam-mpi . why ? > Can you help me ? > > Thank you ! > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] data types and alignment to word boundary
OOOps - i did not intend to cause any heart attacks =:) Perhaps my reaction was a bit exaggerated, but i spent quite some time to figure out why i didn't receive the same numbers i sent off And, after reading section 3.1 of the MPI complete reference i must say that i would have been warned if i had read that chapter more carefully... Fortunately, i don't have to send around a lot of these structs, so i will do the padding (using the offsetof macro Dave recommended). Thanks again Jody On Wed, Jun 29, 2011 at 9:52 PM, Gus Correa wrote: > Hi Jody > > jody wrote: >> >> Guys - Thank You for your replies! >> (wow : that was a rhyme! :) ) >> >> I checked my structure with the offsetof macro on my laptop at home >> and found the following offsets: >> offs iSpeciesID: 0 >> offs sCapacityFile: 2 >> offs adGParams: 68 >> total size 100 >> so there seems to be a 2 byte gap before the double array; >> and this machine seems to prefer multiples of 4. > > A 32-bit laptop perhaps? > I would guess the offsets are machine and compiler dependent, > and optimization flags may matter. > >> >> But is this alignment problem not also a danger for heterogeneous clusters >> using OpenMPI? > > Do you mean danger or excitement? :) > If the doubles and shorts and long longs have different sizes on > each of two heterogeneous nodes, what could MPI do about them anyway? > >> I guess the only portable solution is to forget about MPI Data types and >> somehow pack or serialize the data before sending and unpack/deserialize >> after receiving it. >> > > Jody: > Jeff may have a heart attack when he reads what you just wrote about > the usefulness of MPI data types vs. packing/unpacking. :) > > Guessing away, I would think you are focusing on memory/space savings, > rather than on performance. > Maybe memory/space savings is part of your code requirements. > > However, have you tried instead to explicitly pad your structure, > say, to a multiple of the size of your largest intrinsic type, > which double in your case, or perhaps to a multiple of the natural > memory alignment boundary that your computer/compiler likes (which may > be 8 bytes, 16 bytes, 128 bytes, whatever). > I never did this comparison, but I would guess the padded version > of the code would run faster (if compiled with '-align' type of flag > and friends). > > Anyway, C is a foreign language here, I must say. > > Just my unwarranted guesses. > > Gus Correa > >> >> >> On Wed, Jun 29, 2011 at 6:18 PM, Gus Correa wrote: >>> >>> jody wrote: >>>> >>>> Hi >>>> >>>> I have noticed on my machine that a struct which i have defined as >>>> >>>> typedef struct { >>>> short iSpeciesID; >>>> char sCapacityFile[SHORT_INPUT]; >>>> double adGParams[NUM_GPARAMS]; >>>> } tVStruct; >>>> >>>> (where SHORT_INPUT=64 and NUM_GPARAMS=4) >>>> >>>> has size 104 (instead of 98) whereas the corresponding MPI Datatype i >>>> created >>>> >>>> int aiLengthsT5[3] = {1, SHORT_INPUT, NUM_GPARAMS}; >>>> MPI_Aint aiDispsT5[3] = {0, iShortSize, iShortSize+SHORT_INPUT}; >>>> MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE}; >>>> MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5, >>>> &m_dtVegetationData3); >>>> MPI_Type_commit(&m_dtVegetationData3); >>>> >>>> only has length 98 (as expected). The size differences resulted in an >>>> error when doing >>>> >>>> tVegetationData3 VD; >>>> MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD, >>>> MPI_COMM_WORLD); >>>> >>>> and the corresponding >>>> >>>> tVegetationData3 VD; >>>> MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE, >>>> TAG_STEP_CMD, MPI_COMM_WORLD, &st); >>>> >>>> (in fact, the last double in my array was not transmitted correctly) >>>> >>>> It seems that on my machine the struct was padded to a multiple of 8. >>>> By manually adding some padding bytes to my MPI Datatype in order >>>> to fill it up to the next multiple of 8 i could work around this >>>> problem. >>>> (not very nice, and very probably not portable) >>>> >>>> >>>> My question: is there a way to tell MPI to automatically use the >>>&g
Re: [OMPI users] data types and alignment to word boundary
Guys - Thank You for your replies! (wow : that was a rhyme! :) ) I checked my structure with the offsetof macro on my laptop at home and found the following offsets: offs iSpeciesID: 0 offs sCapacityFile: 2 offs adGParams: 68 total size 100 so there seems to be a 2 byte gap before the double array; and this machine seems to prefer multiples of 4. But is this alignment problem not also a danger for heterogeneous clusters using OpenMPI? I guess the only portable solution is to forget about MPI Data types and somehow pack or serialize the data before sending and unpack/deserialize after receiving it. Jody On Wed, Jun 29, 2011 at 6:18 PM, Gus Correa wrote: > jody wrote: >> >> Hi >> >> I have noticed on my machine that a struct which i have defined as >> >> typedef struct { >> short iSpeciesID; >> char sCapacityFile[SHORT_INPUT]; >> double adGParams[NUM_GPARAMS]; >> } tVStruct; >> >> (where SHORT_INPUT=64 and NUM_GPARAMS=4) >> >> has size 104 (instead of 98) whereas the corresponding MPI Datatype i >> created >> >> int aiLengthsT5[3] = {1, SHORT_INPUT, NUM_GPARAMS}; >> MPI_Aint aiDispsT5[3] = {0, iShortSize, iShortSize+SHORT_INPUT}; >> MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE}; >> MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5, >> &m_dtVegetationData3); >> MPI_Type_commit(&m_dtVegetationData3); >> >> only has length 98 (as expected). The size differences resulted in an >> error when doing >> >> tVegetationData3 VD; >> MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD, MPI_COMM_WORLD); >> >> and the corresponding >> >> tVegetationData3 VD; >> MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE, >> TAG_STEP_CMD, MPI_COMM_WORLD, &st); >> >> (in fact, the last double in my array was not transmitted correctly) >> >> It seems that on my machine the struct was padded to a multiple of 8. >> By manually adding some padding bytes to my MPI Datatype in order >> to fill it up to the next multiple of 8 i could work around this problem. >> (not very nice, and very probably not portable) >> >> >> My question: is there a way to tell MPI to automatically use the >> required padding? >> >> >> Thank You >> Jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Hi Jody > > My naive guesses: > > I think when you create the MPI structure you can pass the > byte displacement of each structure component. > You would need to modify your aiDispsT5[3], to match the > actual memory alignment, I guess. > Yes, indeed portability may be sacrificed. > > There is some clarification in "MPI, The Complete Reference, Vol 1, > 2nd Ed, Marc Snir et al.". > Section 3.2 and 3.3 (general on type map & type signature). > Section 3.4.8 MPI_Type_create_struct (examples, specially 3.13). > Section 3.10, on portability, doesn't seem to guarantee portability of > MPI_Type_Struct. > > I hope this helps, > Gus Correa > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] data types and alignment to word boundary
Hi I have noticed on my machine that a struct which i have defined as typedef struct { short iSpeciesID; char sCapacityFile[SHORT_INPUT]; double adGParams[NUM_GPARAMS]; } tVStruct; (where SHORT_INPUT=64 and NUM_GPARAMS=4) has size 104 (instead of 98) whereas the corresponding MPI Datatype i created int aiLengthsT5[3] = {1, SHORT_INPUT, NUM_GPARAMS}; MPI_Aint aiDispsT5[3]= {0, iShortSize, iShortSize+SHORT_INPUT}; MPI_Datatype aTypesT5[3] = {MPI_UNSIGNED_SHORT, MPI_CHAR, MPI_DOUBLE}; MPI_Type_create_struct(3, aiLengthsT5, aiDispsT5, aTypesT5, &m_dtVegetationData3); MPI_Type_commit(&m_dtVegetationData3); only has length 98 (as expected). The size differences resulted in an error when doing tVegetationData3 VD; MPI_Send(&VD, 1, m_dtVegetationData3, 1, TAG_STEP_CMD, MPI_COMM_WORLD); and the corresponding tVegetationData3 VD; MPI_Recv(&VD, 1, m_dtVegetationData3, MPI_ANY_SOURCE, TAG_STEP_CMD, MPI_COMM_WORLD, &st); (in fact, the last double in my array was not transmitted correctly) It seems that on my machine the struct was padded to a multiple of 8. By manually adding some padding bytes to my MPI Datatype in order to fill it up to the next multiple of 8 i could work around this problem. (not very nice, and very probably not portable) My question: is there a way to tell MPI to automatically use the required padding? Thank You Jody
Re: [OMPI users] problems with the -xterm option
Launching xterm by mpirun onto a remote platform without a command simply opens a xterm-window which sits there until you type exit into it or close it by pressing on the frame's close button. (of course only if the display is forwarded to the local machine) On Mon, May 2, 2011 at 4:30 PM, Ralph Castain wrote: > > On May 2, 2011, at 8:21 AM, jody wrote: > >> Hi >> Well, the difference is that one time i call the application >> 'HelloMPI' with the '--xterm' option, >> whereas in my previous mail i am calling the application 'xterm' >> (without the '--xterm' option) > > Ah, well that might explain it. I don't know how xterm would react to just > being launched by mpirun onto a remote platform without any command to run. I > can't explain what the plm verbosity has to do with anything, though. > >> Jody >> >> On Mon, May 2, 2011 at 4:08 PM, Ralph Castain wrote: >>> >>> On May 2, 2011, at 7:56 AM, jody wrote: >>> >>>> Hi Ralph >>>> >>>> Thank You for doing the fix. >>>> >>>> Do you perhaps also have an idea what is going on when i try to start >>>> xterm (or probably an other X application) on a remote host? >>>> In this case it is not enough to specify the '--leave-session-attached' >>>> option. >>>> >>>> These calls won't open any xterms >>>> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >>>> plm_base_verbose 1 xterm >>>> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" >>>> --leave-session-attached xterm >>>> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >>>> odls_base_verbose 5 xterm >>>> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >>>> odls_base_verbose 5 --leave-session-attached xterm >>>> >>>> But this will open the xterms: >>>> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >>>> plm_base_verbose 1 --leave-session-attached xterm >>>> >>>> Any verbosity level > 0 will open xterms, but with ' -mca >>>> plm_base_verbose 0' there are again no xterms. >>>> >>> >>> No earthly idea...this seems to contradict what you had below. You said you >>> were seeing the xterms with this cmd line: >>> >>>>>> I just found that everything works as expected if i use the the >>>>>> '--leave-session-attached' option (without the debug options): >>>>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca >>>>>> plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! >>>>>> ./HelloMPI >>>>>> The xterms are also opened if i do not use the '!' hold option. >>>>> >>> >>> Did I miss something? >>> >>> >>>> Thank You >>>> Jody >>>> >>>> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain wrote: >>>>> >>>>> On May 2, 2011, at 2:34 AM, jody wrote: >>>>> >>>>>> Hi Ralph >>>>>> >>>>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and >>>>>> squid_0. >>>>>> The results are interesting! >>>>>> >>>>>> I wrote a small HelloMPI app which basically calls usleep for a pause >>>>>> of 5 seconds. >>>>>> >>>>>> Now calling it as i did before, no MPI errors appear anymore, only the >>>>>> display problems: >>>>>> jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca >>>>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI >>>>>> /usr/bin/xterm Xt error: Can't open display: localhost:10.0 >>>>>> >>>>>> When i do the same call *with* the debug option, the xterm appears and >>>>>> shows the output of HelloMPI! >>>>>> I attach the output in ompidbg_1.txt (It also works if i call with >>>>>> '-np 4' and '--xterm 0,1,2,3' >>>>> >>>>> Good! >>>>> >>>>>> >>>>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt). >>>>>> >>>>>> If i use the hold-option, the xterm appears with the outpu
Re: [OMPI users] problems with the -xterm option
Hi Well, the difference is that one time i call the application 'HelloMPI' with the '--xterm' option, whereas in my previous mail i am calling the application 'xterm' (without the '--xterm' option) Jody On Mon, May 2, 2011 at 4:08 PM, Ralph Castain wrote: > > On May 2, 2011, at 7:56 AM, jody wrote: > >> Hi Ralph >> >> Thank You for doing the fix. >> >> Do you perhaps also have an idea what is going on when i try to start >> xterm (or probably an other X application) on a remote host? >> In this case it is not enough to specify the '--leave-session-attached' >> option. >> >> These calls won't open any xterms >> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >> plm_base_verbose 1 xterm >> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" >> --leave-session-attached xterm >> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >> odls_base_verbose 5 xterm >> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >> odls_base_verbose 5 --leave-session-attached xterm >> >> But this will open the xterms: >> mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca >> plm_base_verbose 1 --leave-session-attached xterm >> >> Any verbosity level > 0 will open xterms, but with ' -mca >> plm_base_verbose 0' there are again no xterms. >> > > No earthly idea...this seems to contradict what you had below. You said you > were seeing the xterms with this cmd line: > >>>> I just found that everything works as expected if i use the the >>>> '--leave-session-attached' option (without the debug options): >>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca >>>> plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! >>>> ./HelloMPI >>>> The xterms are also opened if i do not use the '!' hold option. >>> > > Did I miss something? > > >> Thank You >> Jody >> >> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain wrote: >>> >>> On May 2, 2011, at 2:34 AM, jody wrote: >>> >>>> Hi Ralph >>>> >>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0. >>>> The results are interesting! >>>> >>>> I wrote a small HelloMPI app which basically calls usleep for a pause >>>> of 5 seconds. >>>> >>>> Now calling it as i did before, no MPI errors appear anymore, only the >>>> display problems: >>>> jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca >>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI >>>> /usr/bin/xterm Xt error: Can't open display: localhost:10.0 >>>> >>>> When i do the same call *with* the debug option, the xterm appears and >>>> shows the output of HelloMPI! >>>> I attach the output in ompidbg_1.txt (It also works if i call with >>>> '-np 4' and '--xterm 0,1,2,3' >>> >>> Good! >>> >>>> >>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt). >>>> >>>> If i use the hold-option, the xterm appears with the output of >>>> 'hostrname' (cf. ompidbg_3.txt) >>>> The xterm opens after the line "launch complete for job..." has been >>>> written (line 59) >>> >>> Okay, that's also expected. Like I said, without the "hold", the output is >>> generated so quickly that the window just flashes at best. I've had similar >>> experiences - hence the "hold" option. >>> >>>> >>>> I just found that everything works as expected if i use the the >>>> '--leave-session-attached' option (without the debug options): >>>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca >>>> plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! >>>> ./HelloMPI >>>> The xterms are also opened if i do not use the '!' hold option. >>> >>> Okay, I can understand why. The --leave-session-attached option just tells >>> mpirun to not daemonize the backend daemons - thus leaving the ssh session >>> alive. The debug options do the same thing, but turn on all the debug >>> output. >>> >>> The problem is that if you don't leave the ssh sessi
Re: [OMPI users] problems with the -xterm option
Hi Ralph Thank You for doing the fix. Do you perhaps also have an idea what is going on when i try to start xterm (or probably an other X application) on a remote host? In this case it is not enough to specify the '--leave-session-attached' option. These calls won't open any xterms mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 1 xterm mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca odls_base_verbose 5 xterm mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca odls_base_verbose 5 --leave-session-attached xterm But this will open the xterms: mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca plm_base_verbose 1 --leave-session-attached xterm Any verbosity level > 0 will open xterms, but with ' -mca plm_base_verbose 0' there are again no xterms. Thank You Jody On Mon, May 2, 2011 at 2:29 PM, Ralph Castain wrote: > > On May 2, 2011, at 2:34 AM, jody wrote: > >> Hi Ralph >> >> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0. >> The results are interesting! >> >> I wrote a small HelloMPI app which basically calls usleep for a pause >> of 5 seconds. >> >> Now calling it as i did before, no MPI errors appear anymore, only the >> display problems: >> jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca >> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI >> /usr/bin/xterm Xt error: Can't open display: localhost:10.0 >> >> When i do the same call *with* the debug option, the xterm appears and >> shows the output of HelloMPI! >> I attach the output in ompidbg_1.txt (It also works if i call with >> '-np 4' and '--xterm 0,1,2,3' > > Good! > >> >> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt). >> >> If i use the hold-option, the xterm appears with the output of >> 'hostrname' (cf. ompidbg_3.txt) >> The xterm opens after the line "launch complete for job..." has been >> written (line 59) > > Okay, that's also expected. Like I said, without the "hold", the output is > generated so quickly that the window just flashes at best. I've had similar > experiences - hence the "hold" option. > >> >> I just found that everything works as expected if i use the the >> '--leave-session-attached' option (without the debug options): >> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca >> plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! >> ./HelloMPI >> The xterms are also opened if i do not use the '!' hold option. > > Okay, I can understand why. The --leave-session-attached option just tells > mpirun to not daemonize the backend daemons - thus leaving the ssh session > alive. The debug options do the same thing, but turn on all the debug output. > > The problem is that if you don't leave the ssh session alive, then the xterm > has no way back to your screen. By daemonizing, we severe that connection. > > What I should do (and maybe used to do, but it got removed) is automatically > turn "on" the leave-session-attached option if you give --xterm. I can enter > that patch. > > Note that this does limit the size of the launch to the number of ssh > sessions the system allows you to have open at the same time. We default to a > limit of 128 nodes, which is likely adequate for an xterm-based debugging > session. However, you can increase it using an mca param (see ompi_info) to > as high as the system allows. > > Thanks for helping debug this! I'll add you to the patch list so you can > track it. > >> >> What does *not* work is >> jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca >> plm_rsh_agent "ssh -Y" --leave-session-attached xterm >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> >> But then again, this call works (i.e. an xterm is opened) if all the >> debug-options are used (ompidbg_4.txt). >> Here the '--leave-session-attached' is necessary - without it, no xterm. >> >>> From these results i would say that there is no basic mishandling of >> 'ssh', though i have no idea >> what internal differences the use of the '-leave-session-attached' >> option or the debug options make. >> >> I hope these
Re: [OMPI users] problems with the -xterm option
Hi Ralph I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0. The results are interesting! I wrote a small HelloMPI app which basically calls usleep for a pause of 5 seconds. Now calling it as i did before, no MPI errors appear anymore, only the display problems: jody@chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI /usr/bin/xterm Xt error: Can't open display: localhost:10.0 When i do the same call *with* the debug option, the xterm appears and shows the output of HelloMPI! I attach the output in ompidbg_1.txt (It also works if i call with '-np 4' and '--xterm 0,1,2,3' Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt). If i use the hold-option, the xterm appears with the output of 'hostrname' (cf. ompidbg_3.txt) The xterm opens after the line "launch complete for job..." has been written (line 59) I just found that everything works as expected if i use the the '--leave-session-attached' option (without the debug options): jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3! ./HelloMPI The xterms are also opened if i do not use the '!' hold option. What does *not* work is jody@aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca plm_rsh_agent "ssh -Y" --leave-session-attached xterm xterm Xt error: Can't open display: xterm: DISPLAY is not set xterm Xt error: Can't open display: xterm: DISPLAY is not set But then again, this call works (i.e. an xterm is opened) if all the debug-options are used (ompidbg_4.txt). Here the '--leave-session-attached' is necessary - without it, no xterm. >From these results i would say that there is no basic mishandling of 'ssh', though i have no idea what internal differences the use of the '-leave-session-attached' option or the debug options make. I hope these observations are helpful Jody On Fri, Apr 29, 2011 at 12:08 AM, jody wrote: > Hi Ralph > > Thank you for your suggestions. > I'll be happy to help you. > I'm not sure if i'll get around to this tomorrow, > but i certainly will do so on Monday. > > Thanks > Jody > > On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain wrote: >> Hi Jody >> >> I'm not sure when I'll get a chance to work on this - got a deadline to >> meet. I do have a couple of suggestions, if you wouldn't mind helping debug >> the problem? >> >> It looks to me like the problem is that mpirun is crashing or terminating >> early for some reason - hence the failures to send msgs to it, and the >> "lifeline lost" error that leads to the termination of the daemon. If you >> build a debug version of the code (i.e., --enable-debug on configure), you >> can get a lot of debug info that traces the behavior. >> >> If you could then run your program with >> >> -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached >> >> and send it to me, we'll see what ORTE thinks it is doing. >> >> You could also take a look at the code for implementing the xterm option. >> You'll find it in >> >> orte/mca/odls/base/odls_base_default_fns.c >> >> around line 1115. The xterm command syntax is defined in >> >> orte/mca/odls/base/odls_base_open.c >> >> around line 233 and following. Note that we use "xterm -T" as the cmd. >> Perhaps you can spot an error in the way we treat xterm? >> >> Also, remember that you have to specify that you want us to "hold" the xterm >> window open even after the process terminates. If you don't specify it, the >> window automatically closes upon completion of the process. So a >> fast-running cmd like "hostname" might disappear so quickly that it causes a >> race condition problem. >> >> You might want to try a spinner application - i.e.., output something and >> then sit in a loop or sleep for some period of time. Or, use the "hold" >> option to keep the window open - you designate "hold" by putting a '!' >> before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname" >> >> >> On Apr 28, 2011, at 8:38 AM, jody wrote: >> >>> Hi >>> >>> Unfortunately this does not solve my problem. >>> While i can do >>> ssh -Y squid_0 xterm >>> and this will open an xterm on m,y machiine (chefli), >>> i run into problems with the -xterm option of openmpi: >>> >>> jody@chefli ~/share/neander $ mp
Re: [OMPI users] problems with the -xterm option
Hi Ralph Thank you for your suggestions. I'll be happy to help you. I'm not sure if i'll get around to this tomorrow, but i certainly will do so on Monday. Thanks Jody On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain wrote: > Hi Jody > > I'm not sure when I'll get a chance to work on this - got a deadline to meet. > I do have a couple of suggestions, if you wouldn't mind helping debug the > problem? > > It looks to me like the problem is that mpirun is crashing or terminating > early for some reason - hence the failures to send msgs to it, and the > "lifeline lost" error that leads to the termination of the daemon. If you > build a debug version of the code (i.e., --enable-debug on configure), you > can get a lot of debug info that traces the behavior. > > If you could then run your program with > > -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached > > and send it to me, we'll see what ORTE thinks it is doing. > > You could also take a look at the code for implementing the xterm option. > You'll find it in > > orte/mca/odls/base/odls_base_default_fns.c > > around line 1115. The xterm command syntax is defined in > > orte/mca/odls/base/odls_base_open.c > > around line 233 and following. Note that we use "xterm -T" as the cmd. > Perhaps you can spot an error in the way we treat xterm? > > Also, remember that you have to specify that you want us to "hold" the xterm > window open even after the process terminates. If you don't specify it, the > window automatically closes upon completion of the process. So a fast-running > cmd like "hostname" might disappear so quickly that it causes a race > condition problem. > > You might want to try a spinner application - i.e.., output something and > then sit in a loop or sleep for some period of time. Or, use the "hold" > option to keep the window open - you designate "hold" by putting a '!' before > the rank, e.g., "mpirun -np 2 -xterm \!2 hostname" > > > On Apr 28, 2011, at 8:38 AM, jody wrote: > >> Hi >> >> Unfortunately this does not solve my problem. >> While i can do >> ssh -Y squid_0 xterm >> and this will open an xterm on m,y machiine (chefli), >> i run into problems with the -xterm option of openmpi: >> >> jody@chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh >> -Y" -host squid_0 --xterm 1 hostname >> squid_0 >> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >> lifeline [[35219,0],0] lost >> [squid_0:28046] [[35219,0],1]->[[35219,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to >> lifeline [[35219,0],0] lost >> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >> >> By the way when i look at the DISPLAY variable in the xterm window >> opened via squid_0, >> i also have the display variable "localhost:11.0" >> >> Actually, the difference with using the "-mca plm_rsh_agent" is that >> the lines wiht the warnings about "xauth" and "untrusted X" do not >> appear: >> >> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 hostname >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> squid_0 >> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >> lifeline [[34926,0],0] lost >> [squid_0:28337] [[34926,0],1]->[[34926,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to >> lifeline [[34926,0],0] lost >> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >> >> >> I have doubts that the "-Y" is passed correctly: >> jody@triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh >> -Y" -host squid_0 xterm >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> >> >> ---> as a matte
Re: [OMPI users] problems with the -xterm option
Hi Unfortunately this does not solve my problem. While i can do ssh -Y squid_0 xterm and this will open an xterm on m,y machiine (chefli), i run into problems with the -xterm option of openmpi: jody@chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh -Y" -host squid_0 --xterm 1 hostname squid_0 [squid_0:28046] [[35219,0],1]->[[35219,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [squid_0:28046] [[35219,0],1] routed:binomial: Connection to lifeline [[35219,0],0] lost [squid_0:28046] [[35219,0],1]->[[35219,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [squid_0:28046] [[35219,0],1] routed:binomial: Connection to lifeline [[35219,0],0] lost /usr/bin/xterm Xt error: Can't open display: localhost:11.0 By the way when i look at the DISPLAY variable in the xterm window opened via squid_0, i also have the display variable "localhost:11.0" Actually, the difference with using the "-mca plm_rsh_agent" is that the lines wiht the warnings about "xauth" and "untrusted X" do not appear: jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 hostname Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. squid_0 [squid_0:28337] [[34926,0],1]->[[34926,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [squid_0:28337] [[34926,0],1] routed:binomial: Connection to lifeline [[34926,0],0] lost [squid_0:28337] [[34926,0],1]->[[34926,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [squid_0:28337] [[34926,0],1] routed:binomial: Connection to lifeline [[34926,0],0] lost /usr/bin/xterm Xt error: Can't open display: localhost:11.0 I have doubts that the "-Y" is passed correctly: jody@triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh -Y" -host squid_0 xterm xterm Xt error: Can't open display: xterm: DISPLAY is not set xterm Xt error: Can't open display: xterm: DISPLAY is not set ---> as a matter of fact i noticed that the xterm option doesn't work locally: mpirun -np 4-xterm 1 /usr/bin/printenv prints verything onto the console. Do you have any other suggestions i could try? Thank You Jody On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain wrote: > Should be able to just set > > -mca plm_rsh_agent "ssh -Y" > > on your cmd line, I believe > > On Apr 28, 2011, at 12:53 AM, jody wrote: > >> Hi Ralph >> >> Is there an easy way i could modify the OpenMPI code so that it would use >> the -Y option for ssh when connecting to remote machines? >> >> Thank You >> Jody >> >> On Thu, Apr 7, 2011 at 4:01 PM, jody wrote: >>> Hi Ralph >>> thank you for your suggestions. After some fiddling, i found that after my >>> last update (gentoo) my sshd_config had been overwritten >>> (X11Forwarding was set to 'no'). >>> >>> After correcting that, i can now open remote terminals with 'ssh -Y' >>> and with 'ssh -X' >>> (but with '-X' is till get those xauth warnings) >>> >>> But the xterm option still doesn't work: >>> jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 >>> printenv | grep WORLD_RANK >>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>> generated >>> Warning: No xauth data; using fake authentication data for X11 forwarding. >>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0 >>> OMPI_COMM_WORLD_RANK=0 >>> [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] >>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >>> [sd = 8] >>> [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to >>> lifeline [[54132,0],0] lost >>> >>> So it looks like the two processes from squid_0 can't open the display this >>> way, >>> but one of them writes the output to the console... >>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' >>> the >>> DISPLAY variable is set to 'localhost:10.0' >>> >>> So in what way would OMPI have to be adapted, so -xterm would work? >>> >>> Thank You >>> Jody >>> >>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain wrote: >>>> Here's a little more info - it's for Cygwin, but I don't see a
Re: [OMPI users] problems with the -xterm option
Hi Ralph Is there an easy way i could modify the OpenMPI code so that it would use the -Y option for ssh when connecting to remote machines? Thank You Jody On Thu, Apr 7, 2011 at 4:01 PM, jody wrote: > Hi Ralph > thank you for your suggestions. After some fiddling, i found that after my > last update (gentoo) my sshd_config had been overwritten > (X11Forwarding was set to 'no'). > > After correcting that, i can now open remote terminals with 'ssh -Y' > and with 'ssh -X' > (but with '-X' is till get those xauth warnings) > > But the xterm option still doesn't work: > jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 > printenv | grep WORLD_RANK > Warning: untrusted X11 forwarding setup failed: xauth key data not generated > Warning: No xauth data; using fake authentication data for X11 forwarding. > /usr/bin/xterm Xt error: Can't open display: localhost:11.0 > /usr/bin/xterm Xt error: Can't open display: localhost:11.0 > OMPI_COMM_WORLD_RANK=0 > [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) > [sd = 8] > [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to > lifeline [[54132,0],0] lost > > So it looks like the two processes from squid_0 can't open the display this > way, > but one of them writes the output to the console... > Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the > DISPLAY variable is set to 'localhost:10.0' > > So in what way would OMPI have to be adapted, so -xterm would work? > > Thank You > Jody > > On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain wrote: >> Here's a little more info - it's for Cygwin, but I don't see anything >> Cygwin-specific in the answers: >> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding >> >> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: >> >> Sorry Jody - I should have read your note more carefully to see that you >> already tried -Y. :-( >> Not sure what to suggest... >> >> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: >> >> Like I said, I'm not expert. However, a quick "google" of revealed this >> result: >> >> When trying to set up x11 forwarding over an ssh session to a remote server >> with the -X switch, I was getting an error like Warning: No xauth >> data; using fake authentication data for X11 forwarding. >> >> When doing something like: >> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I >> got an error message like: >> >> >> jason@badman ~/bin $ ssh -Xl root 10.1.1.9 >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5 >> [root@RHEL ~]# >> and any X programs I ran would not display on my local system.. >> >> Turns out the solution is to use the -Y switch instead. >> >> ssh -Yl root 10.1.1.9 >> >> and that worked fine. >> >> See if that works for you - if it does, we may have to modify OMPI to >> accommodate. >> >> On Apr 6, 2011, at 9:19 AM, jody wrote: >> >> Hi Ralph >> No, after the above error message mpirun has exited. >> >> But i also noticed that it is to ssh into squid_0 and open a xterm there: >> >> jody@chefli ~/share/neander $ ssh -Y squid_0 >> Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 >> jody@squid_0 ~ $ xterm >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 >> jody@squid_0 ~ $ xterm >> xterm Xt error: Can't open display: 130.60.126.74:0.0 >> jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 >> jody@squid_0 ~ $ xterm >> xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >> jody@squid_0 ~ $ exit >> logout >> >> same thing with ssh -X, but here i get the same warning/error message >> as with mpirun: >> >> jody@chefli ~/share/neander $ ssh -X squid_0 >> Warning: untrusted X11 forwarding setup failed: xauth key data not >> generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh >> >> So perhaps the whole problem is linked to that xauth-thing. >> Do you have a suggestion how this can be solved? >> >>
Re: [OMPI users] problems with the -xterm option
Hi Ralph thank you for your suggestions. After some fiddling, i found that after my last update (gentoo) my sshd_config had been overwritten (X11Forwarding was set to 'no'). After correcting that, i can now open remote terminals with 'ssh -Y' and with 'ssh -X' (but with '-X' is till get those xauth warnings) But the xterm option still doesn't work: jody@chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2 printenv | grep WORLD_RANK Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. /usr/bin/xterm Xt error: Can't open display: localhost:11.0 /usr/bin/xterm Xt error: Can't open display: localhost:11.0 OMPI_COMM_WORLD_RANK=0 [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to lifeline [[54132,0],0] lost So it looks like the two processes from squid_0 can't open the display this way, but one of them writes the output to the console... Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the DISPLAY variable is set to 'localhost:10.0' So in what way would OMPI have to be adapted, so -xterm would work? Thank You Jody On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain wrote: > Here's a little more info - it's for Cygwin, but I don't see anything > Cygwin-specific in the answers: > http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding > > On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote: > > Sorry Jody - I should have read your note more carefully to see that you > already tried -Y. :-( > Not sure what to suggest... > > On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote: > > Like I said, I'm not expert. However, a quick "google" of revealed this > result: > > When trying to set up x11 forwarding over an ssh session to a remote server > with the -X switch, I was getting an error like Warning: No xauth > data; using fake authentication data for X11 forwarding. > > When doing something like: > ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I > got an error message like: > > > jason@badman ~/bin $ ssh -Xl root 10.1.1.9 > Warning: untrusted X11 forwarding setup failed: xauth key data not generated > Warning: No xauth data; using fake authentication data for X11 forwarding. > Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5 > [root@RHEL ~]# > and any X programs I ran would not display on my local system.. > > Turns out the solution is to use the -Y switch instead. > > ssh -Yl root 10.1.1.9 > > and that worked fine. > > See if that works for you - if it does, we may have to modify OMPI to > accommodate. > > On Apr 6, 2011, at 9:19 AM, jody wrote: > > Hi Ralph > No, after the above error message mpirun has exited. > > But i also noticed that it is to ssh into squid_0 and open a xterm there: > > jody@chefli ~/share/neander $ ssh -Y squid_0 > Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 > jody@squid_0 ~ $ xterm > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 > jody@squid_0 ~ $ xterm > xterm Xt error: Can't open display: 130.60.126.74:0.0 > jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 > jody@squid_0 ~ $ xterm > xterm Xt error: Can't open display: chefli.uzh.ch:0.0 > jody@squid_0 ~ $ exit > logout > > same thing with ssh -X, but here i get the same warning/error message > as with mpirun: > > jody@chefli ~/share/neander $ ssh -X squid_0 > Warning: untrusted X11 forwarding setup failed: xauth key data not > generated > Warning: No xauth data; using fake authentication data for X11 forwarding. > Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh > > So perhaps the whole problem is linked to that xauth-thing. > Do you have a suggestion how this can be solved? > > Thank You > Jody > > On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain wrote: > > If I read your error messages correctly, it looks like mpirun is crashing - > the daemon is complaining that it lost the socket connection back to mpirun, > and hence will abort. > > Are you seeing mpirun still alive? > > > On Apr 5, 2011, at 4:46 AM, jody wrote: > > Hi > > On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that > > it works in "text-mode": > > $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK > > OMPI_COMM_WORLD_RANK=0 > > OMPI_COMM_WOR
Re: [OMPI users] problems with the -xterm option
Hi Ralph No, after the above error message mpirun has exited. But i also noticed that it is to ssh into squid_0 and open a xterm there: jody@chefli ~/share/neander $ ssh -Y squid_0 Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0 jody@squid_0 ~ $ xterm xterm Xt error: Can't open display: xterm: DISPLAY is not set jody@squid_0 ~ $ export DISPLAY=130.60.126.74:0.0 jody@squid_0 ~ $ xterm xterm Xt error: Can't open display: 130.60.126.74:0.0 jody@squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0 jody@squid_0 ~ $ xterm xterm Xt error: Can't open display: chefli.uzh.ch:0.0 jody@squid_0 ~ $ exit logout same thing with ssh -X, but here i get the same warning/error message as with mpirun: jody@chefli ~/share/neander $ ssh -X squid_0 Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh So perhaps the whole problem is linked to that xauth-thing. Do you have a suggestion how this can be solved? Thank You Jody On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain wrote: > If I read your error messages correctly, it looks like mpirun is crashing - > the daemon is complaining that it lost the socket connection back to mpirun, > and hence will abort. > > Are you seeing mpirun still alive? > > > On Apr 5, 2011, at 4:46 AM, jody wrote: > >> Hi >> >> On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that >> it works in "text-mode": >> $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK >> OMPI_COMM_WORLD_RANK=0 >> OMPI_COMM_WORLD_RANK=1 >> OMPI_COMM_WORLD_RANK=2 >> OMPI_COMM_WORLD_RANK=3 >> >> but when i use the -xterm option to mpirun, it doesn't work >> >> $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep >> WORLD_RANK >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> OMPI_COMM_WORLD_RANK=0 >> [squid_0:05266] [[55607,0],1]->[[55607,0],0] >> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) >> [sd = 8] >> [squid_0:05266] [[55607,0],1] routed:binomial: Connection to >> lifeline [[55607,0],0] lost >> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 >> >> (strange: somebody wrote his message to the console) >> >> No matter whether i set the DISPLAY variable to the full hostname of >> the workstation, >> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work >> >> But i do have xauth data (as far as i know): >> On the remote (squid_0): >> jody@squid_0 ~ $ xauth list >> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >> chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >> >> on the workstation: >> $ xauth list >> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c >> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >> localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1 >> 146c7f438fab79deb8a8a7df242b6f4b >> chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b >> >> In sshd_config on the workstation i have 'X11Forwarding yes' >> I have also done >> xhost + squid_0 >> on the workstation. >> >> >> How can i get the -xterm option running? >> >> Thank You >> Jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] problems with the -xterm option
Hi On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that it works in "text-mode": $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK OMPI_COMM_WORLD_RANK=0 OMPI_COMM_WORLD_RANK=1 OMPI_COMM_WORLD_RANK=2 OMPI_COMM_WORLD_RANK=3 but when i use the -xterm option to mpirun, it doesn't work $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep WORLD_RANK Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. OMPI_COMM_WORLD_RANK=0 [squid_0:05266] [[55607,0],1]->[[55607,0],0] mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9) [sd = 8] [squid_0:05266] [[55607,0],1] routed:binomial: Connection to lifeline [[55607,0],0] lost /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0 (strange: somebody wrote his message to the console) No matter whether i set the DISPLAY variable to the full hostname of the workstation, to the IP-Adress of the workstation or simply to ":0.0", it doesn't work But i do have xauth data (as far as i know): On the remote (squid_0): jody@squid_0 ~ $ xauth list chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b on the workstation: $ xauth list chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b In sshd_config on the workstation i have 'X11Forwarding yes' I have also done xhost + squid_0 on the workstation. How can i get the -xterm option running? Thank You Jody
Re: [OMPI users] WRF Problem running in Parallel
Hi At a first glance i would say this is not a OpenMPI problem, but a wrf problem (though io must admit i have no knowledge whatsoever ith wrf) Have you tried running a single instance of wrf.exe? Have you tried to run a simple application (like a "hello world") on your nodes? Jody On Tue, Feb 22, 2011 at 7:37 AM, Ahsan Ali wrote: > Hello, > I an stuck in a problem that is regarding the running for Weather research > and Forecasting Model (WRFV 3.2.1). I get the following error while running > with mpirun. Any help would be highly appreciated. > > [pmdtest@pmd02 em_real]$ mpirun -np 4 wrf.exe > starting wrf task 0 of 4 > starting wrf task 1 of 4 > starting wrf task 3 of 4 > starting wrf task 2 of 4 > -- > mpirun noticed that process rank 3 with PID 6044 on node pmd02.pakmet.com > exited on signal 11 (Segmentation fault). > > > > -- > Syed Ahsan Ali Bokhari > Electronic Engineer (EE) > Research & Development Division > Pakistan Meteorological Department H-8/4, Islamabad. > Phone # off +92518358714 > Cell # +923155145014 > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] calling a customized MPI_Allreduce with MPI_PACKED datatype
Hi Massimo Just to make sure: usually the MPI_ERR_TUNCATE error is caused by buffer sizes that are too small. Can you verify that the buffers you are using are large enough to hold the data they should receive? Jody On Sat, Feb 5, 2011 at 6:37 PM, Massimo Cafaro wrote: > Dear all, > > in one of my C codes developed using Open MPI v1.4.3 I need to call > MPI_Allreduce() passing as sendbuf and recvbuf arguments two MPI_PACKED > arrays. The reduction requires my own MPI_User_function that needs to > MPI_Unpack() its first and second argument, process them and finally > MPI_Pack() the result in the second argument. > > I need to use MPI_Pack/MPI_Unpack because I am not able to create a derived > datatype, since many data I need to send are dynamically allocated. > However, the code fails at runtime with the following message: > > An error occurred in MPI_Unpack > on communicator MPI_COMM_WORLD > MPI_ERR_TRUNCATE: message truncated > MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > > I have verified that, after unpacking the data in my own reduction function, > all of the data are wrong. > Is this possible in MPI? I did not find anything on the "MPI reference Volume > 1" and "Using MPI" that prevents this. This should just require using as > datatype MPI_PACKED in MPI_Allreduce() . However, searching on the web I did > not find any examples. > > Thank you in advance for any clue/suggestions/source code examples. > This is driving me crazy now ;-( > > Massimo Cafaro > > > - > > *** > > Massimo Cafaro, Ph.D. Additional affiliations: > Assistant Professor Euro-Mediterranean > Centre for Climate Change > Dept. of Engineering for Innovation SPACI Consortium > University of Salento, Lecce, Italy E-mail > massimo.caf...@unisalento.it > Via per Monteroni > massimo.caf...@cmcc.it > 73100 Lecce, Italy > caf...@ieee.org > Voice/Fax +39 0832 297371 > caf...@acm.org > Web http://sara.unisalento.it/~cafaro > > > *** > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] heterogenous cluster
Thaks all I did the simple copying of the 32Bit applications and now it works. Thanks Jody On Wed, Feb 2, 2011 at 5:47 PM, David Mathog wrote: > jody wrote: > >> How can i force OpenMPI to be built as a 32Bit application on a 64Bit > machine? > > THe easiest way is not to - just copy over a build from a 32 bit > machine, it will run on your 64 bit machine if the proper 32 bit > libraries have been installed there. Otherwise, you need to put -m32 > on the gcc commmand lines. Generally one does that by something like: > > export CFLAGS=-m32 > > before running configure to generate Makefiles. > > Regards, > > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] heterogenous cluster
Thanks for your reply. If i try your suggestion, every process fails with the following message: *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [aim-triops:15460] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! I think this is caused by the fact that on the 64Bit machine Open MPI is also built as a 64 bit application. How can i force OpenMPI to be built as a 32Bit application on a 64Bit machine? Thank You Jody On Tue, Feb 1, 2011 at 9:00 PM, David Mathog wrote: > >> I have sofar used a homogenous 32-bit cluster. >> Now i have added a new machine which is 64 bit >> >> This means i have to reconfigure open MPI with > `--enable-heterogeneous`, right? > > Not necessarily. If you don't need the 64bit capabilities you could run > 32 bit binaries along with a 32 bit version of OpenMPI. At least that > approach has worked so far for me. > > Regards, > > David Mathog > mat...@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] heterogenous cluster
Hi I have sofar used a homogenous 32-bit cluster. Now i have added a new machine which is 64 bit This means i have to reconfigure open MPI with `--enable-heterogeneous`, right? Do i have to do this on every machine? I don't remember all the option i had chosen when i first did the configure - is there a way to find this out? Thank You Jody
Re: [OMPI users] [SPAM:### 83%] problem when compiling ompenmpi V1.5.1
Hi if i rememmber correctly, "omp.h" is a header file for OpenMP which is not the same as Open MPI. So it looks like you have to install OpenMP, Then you can compile it with the compiler option -fopenmp (in gcc) Jody On Thu, Dec 16, 2010 at 11:56 AM, Bernard Secher - SFME/LGLS wrote: > I get the following error message when I compile openmpi V1.5.1: > > CXX otfprofile-otfprofile.o > ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:11:18: > erreur: omp.h : Aucun fichier ou dossier de ce type > ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp: > In function ‘int main(int, const char**)’: > ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:325: > erreur: ‘omp_set_num_threads’ was not declared in this scope > ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:460: > erreur: ‘omp_get_thread_num’ was not declared in this scope > ../../../../../../../../../openmpi-1.5.1-src/ompi/contrib/vt/vt/extlib/otf/tools/otfprofile/otfprofile.cpp:471: > erreur: ‘omp_get_num_threads’ was not declared in this scope > > The compiler doesn't find the omp.h file. > What happens ? > > Best > Bernard > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Guaranteed run rank 0 on a given machine?
In a similar situation i wrote a simple shell script "rankcreate.sh" which creates a rank file assigning the various ranks to the correct processors/slots when given a number of processes. In addition, this script returns the name of this created rank file. I then use it like this: mpirun -np 5 --rankfile `rankcreate.sh 5` myApplication May be this is of use for you jody On Fri, Dec 10, 2010 at 11:50 PM, Eugene Loh wrote: > David Mathog wrote: > >> Also, in my limited testing --host and -hostfile seem to be mutually >> exclusive. >> > No. You can use both together. Indeed, the mpirun man page even has > examples of this (though personally, I don't see having a use for this). I > think the idea was you might use a hostfile to define the nodes in your > cluster and an mpirun command line that uses --host to select specific nodes > from the file. > >> That is reasonable, but it isn't clear that it is intended. >> Example, with a hostfile containing one entry for "monkey02.cluster >> slots=1": >> >> mpirun --host monkey01 --mca plm_rsh_agent rsh hostname >> monkey01.cluster >> > > Okay. > >> mpirun --host monkey02 --mca plm_rsh_agent rsh hostname >> monkey02.cluster >> > > Okay. > >> mpirun -hostfile /usr/common/etc/openmpi.machines.test1 \ >> --mca plm_rsh_agent rsh hostname >> monkey02.cluster >> > > Okay. > >> mpirun --host monkey01 \ >> -hostfile /usr/commom/etc/openmpi.machines.test1 \ >> --mca plm_rsh_agent rsh hostname >> -- >> There are no allocated resources for the application hostname >> that match the requested mapping: >> >> Verify that you have mapped the allocated resources properly using the >> --host or --hostfile specification. >> -- >> > > Right. Your hostfile has monkey02. On the command line, you specify > monkey01, but that's not in your hostfile. That's a problem. Just like on > the mpirun man page. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] using totalview
Hi I am currently testing a demo version of totalview. I am putting this question here, because the totalview manual is very sparse on information about OpenMPI. The first question is how to start totalview with mpirun. I saw that mpirun has some inbuilt totalview capability. For debugging: -debug, --debug Invoketheuser-leveldebuggerindicatedby the orte_base_user_debugger MCA parameter. -debugger, --debugger Sequence of debuggers to search for when --debug is used (i.e. a synonym for orte_base_user_debugger MCA parameter). -tv, --tv Launch processes under the TotalView debugger. Deprecated back- wards compatibility flag. Synonym for --debug. I tried 'mpirun -np 4 -tv HelloMPI' but that seemed to b debugging mpirun, and i wasn't able to open the source window for HelloMPI.cpp. I don't understand how the '--debug' option must be used; in particular, i don't understand "user-level debugger indicated by the orte_base_user_debugger MCA parameter." Another question (which might be solved if i can correctly start up totalview) concerns the hostfile and rankfile parameters of mpirun: how can i start an open mpi application with totalview so that my application starts the processes on the correct processors as defined in hostfile and rankfile? Thank You Jody
Re: [OMPI users] message truncated error
Hi Jack > the buffersize is the same in two iterations. this doesn't help if the message which is sent is larger than buffersize in the second iteration. But as David says, without the details of the message sending and potential changes to the receive buffer one can't make any precise diagnosis. jody On Mon, Nov 1, 2010 at 6:41 PM, Jack Bryan wrote: > thanks > I use > double* recvArray = new double[buffersize]; > The receive buffer size > MPI::COMM_WORLD.Recv(&(recvDataArray[0]), xVSize, MPI_DOUBLE, 0, mytaskTag); > delete [] recvArray ; > In first iteration, the receiver works well. > But, in second iteration , > I got the > MPI_ERR_TRUNCATE: message truncated > the buffersize is the same in two iterations. > > ANy help is appreciated. > thanks > Nov. 1 2010 > >> Date: Mon, 1 Nov 2010 08:08:08 +0100 >> From: jody@gmail.com >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] message truncated error >> >> Hi Jack >> >> Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv >> (or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in. >> Check your code to make sure you assign enough memory to your buffers. >> >> regards >> Jody >> >> >> On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan wrote: >> > HI, >> > In my MPI program, master send many msaages to another worker with the >> > same >> > tag. >> > The worker uses >> > s >> > MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1, >> > message_para_to_workers_type, 0, downStreamTaskTag); >> > to receive the messages >> > I got error: >> > >> > n36:94880] *** An error occurred in MPI_Recv >> > [n36:94880] *** on communicator MPI_COMM_WORLD >> > [n36:94880] *** MPI_ERR_TRUNCATE: message truncated >> > [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) >> > [n36:94880] *** Process received signal *** >> > [n36:94880] Signal: Segmentation fault (11) >> > [n36:94880] Signal code: Address not mapped (1) >> > >> > Is this (the same tag) the reason for the errors ? >> > ANy help is appreciated. >> > thanks >> > Jack >> > Oct. 31 2010 >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] link problem on 64bit platform
HI @trent no, i didn't use the other calls, because i think they are all the same (on my installation they are all soft links to opal_wrapper) @tim gentoo on 64 bit does have lib and lib64 directories for the respective architectures (at / and at /usr) but in my 64-bit installation of openMPI there is no lib64 directory, only a lib. I thought the naming of the internal directory structure of openMPI would be determined by the installation (i.e. the `make install`) and not by the operating system. @jeff i don't remember the particular CFLAGS or CXXFLAGS i had used, but i have now rebuilt openMPI with ./configure CFLAGS=-m64 CXXFLAGS=-m64 --prefix=/opt/openmpi-1.4.2-64 --with-threads --disable-mpi-f77 --disable-mpi-f90 and now the problem has been solved. After something similar has then happened when trying to do 32bit compilations, i think i found out what the original problem was: I had first done a 64 bit installation of OpenMPI installed under /opt/openmpi-1.4.2. I later renamed this directory to /opt/openmpi-1.4.2-64, and installed a 32bit version of OpenMPI in /opt/openmpi-1.4.2 Apparently, when i the tried to do a 64bit compilation, the linker looked into the lib-directory with the *original* name /opt/openmpi-1.4.2 instead of /opt/openmpi-1.4.2-64, so of course it only found the 32bit libs of the newer installation. To test this assumption i now renamed the 64-bit installation set my /opt/openmpi link to the new directory and tried to compile: jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp Cannot open configuration file /opt/openmpi-1.4.2-64/share/openmpi/mpiCC-wrapper-data.txt Error parsing data file mpiCC: Not found So again, it looked into the original installation directory of the 64-bit installation for some files So i guess the basic question is: is it permitted to rename openMPI installations, and if yes how is this porperly done (since a simple mv doesn't work) Sorry about the imprecise question. Indeed, if i had looked exactly at the original output, i should have noticed that the linker was looking in the wrong directory. Thank You Jody Thanks anyway Jody On Mon, Nov 1, 2010 at 1:52 PM, Tim Prince wrote: > On 11/1/2010 5:24 AM, Jeff Squyres wrote: >> >> On Nov 1, 2010, at 5:20 AM, jody wrote: >> >>> jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp >>> >>> /usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld: >>> skipping incompatible /opt/openmpi-1.4.2/lib/libmpi_cxx.so when >>> searching for -lmpi_cxx >> >> This is the key message -- it found libmpi_cxx.so, but the linker deemed >> it incompatible, so it skipped it. > > Typically, it means that the cited library is a 32-bit one, to which the > 64-bit ld will react in this way. You could have verified this by > file /opt/openmpi-1.4.2/lib/* > By normal linux conventions a directory named /lib/ as opposed to /lib64/ > would contain only 32-bit libraries. If gentoo doesn't conform with those > conventions, maybe you should do your learning on a distro which does. > > -- > Tim Prince > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] link problem on 64bit platform
Hi On a newly installed 64bit linux (2.6.32-gentoo-r7) with gcc version 4.4.4 i can't compile even simple Open-MPI applications (OpenMPI 1.4.2). The message is: jody@aim-squid_0 ~/progs $ mpiCC -g -o HelloMPI HelloMPI.cpp /usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld: skipping incompatible /opt/openmpi-1.4.2/lib/libmpi_cxx.so when searching for -lmpi_cxx /usr/lib/gcc/x86_64-pc-linux-gnu/4.4.4/../../../../x86_64-pc-linux-gnu/bin/ld: cannot find -lmpi_cxx collect2: ld returned 1 exit status I am using the 64bit mpiCC: jody@aim-squid_0 ~/progs $ which mpiCC /opt/openmpi/bin/mpiCC jody@aim-squid_0 ~/progs $ ls -l /opt/openmpi lrwxrwxrwx 1 root root 22 Nov 1 09:56 /opt/openmpi -> /opt/openmpi-1.4.2-64/ The mpi_cxx should be found in the lib subdirectory: jody@aim-squid_0 ~/progs $ ls -l /opt/openmpi/lib/libmpi_cxx* -rwxr-xr-x 1 root root 1073 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.la lrwxrwxrwx 1 root root 19 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.so -> libmpi_cxx.so.0.0.1 lrwxrwxrwx 1 root root 19 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.so.0 -> libmpi_cxx.so.0.0.1 -rwxr-xr-x 1 root root 137442 Jun 24 15:50 /opt/openmpi/lib/libmpi_cxx.so.0.0.1 PATH and LD_LIBRARY_PATH contain the correct paths: jody@aim-squid_0 ~/progs $ echo $PATH /opt/openmpi/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/x86_64-pc-linux-gnu/gcc-bin/4.4.4 jody@aim-squid_0 ~/progs $ echo $LD_LIBRARY_PATH /opt/openmpi/lib: AM i missing something? Thank You jody
Re: [OMPI users] message truncated error
Hi Jack Usually MPI_ERR_TRUNCATE means that the buffer you use in MPI_Recv (or MPI::COMM_WORLD.Recv) is too sdmall to hold the message coming in. Check your code to make sure you assign enough memory to your buffers. regards Jody On Mon, Nov 1, 2010 at 7:26 AM, Jack Bryan wrote: > HI, > In my MPI program, master send many msaages to another worker with the same > tag. > The worker uses > s > MPI::COMM_WORLD.Recv(&message_para_to_one_worker, 1, > message_para_to_workers_type, 0, downStreamTaskTag); > to receive the messages > I got error: > > n36:94880] *** An error occurred in MPI_Recv > [n36:94880] *** on communicator MPI_COMM_WORLD > [n36:94880] *** MPI_ERR_TRUNCATE: message truncated > [n36:94880] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [n36:94880] *** Process received signal *** > [n36:94880] Signal: Segmentation fault (11) > [n36:94880] Signal code: Address not mapped (1) > > Is this (the same tag) the reason for the errors ? > ANy help is appreciated. > thanks > Jack > Oct. 31 2010 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Help with a strange error
Hi I have a (rather complex) OpenMPI application which works nicely. In the main file i have the function main(), in which MPI_Comm_size() and MPI_Comm_rank() are being called. However, when i add a function check() to the main file, process 0 will crash in PMPI_Comm_size(), even when the function check() is not called! All other processes hang inside PMPI_Init(). The crash also occurs when the function check() is written after the function main The gdb stack trace for process 0: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208715568 (LWP 10072)] 0x0016cb16 in PMPI_Comm_size () from /opt/openmpi/lib/libmpi.so.0 Current language: auto; currently c (gdb) where #0 0x0016cb16 in PMPI_Comm_size () from /opt/openmpi/lib/libmpi.so.0 #1 0x080c379d in main (iArgC=14, apArgV=0xbfc60bc4) at TDMain.cpp:22 Missing separate debuginfos, use: debuginfo-install gcc.i386 zlib.i386 (gdb) I am using OpenMPI 1.4.2 Has anybody got an idea how i could find the problem? Thank You Jody
Re: [OMPI users] Using hostfile with default hostfile
Where is the option 'default-hostfile' described? It does not appear in mpirun's man page (for v. 1.4.2) and i couldn't find anything like that with googling. Jody On Wed, Oct 27, 2010 at 4:02 PM, Ralph Castain wrote: > Specify your hostfile as the default one: > > mpirun --default-hostfile ./Cluster.hosts > > Otherwise, we take the default hostfile and then apply the hostfile as a > filter to select hosts from within it. Sounds strange, I suppose, but the > idea is that the default hostfile can contain configuration info (#sockets, > #cores/socket, etc.) that you might not want to have to put in every hostfile. > > > On Oct 27, 2010, at 7:51 AM, Stefan Kuhne wrote: > >> Hello, >> >> my Cluster has a configured default hostfile. >> >> When i use another hostfile for one job i get: >> >> cluster-admin@Head:~/Cluster/hello$ mpirun --hostfile ../Cluster.hosts >> ./hello >> -- >> There are no allocated resources for the application >> ./hello >> that match the requested mapping: >> ../Cluster.hosts >> >> Verify that you have mapped the allocated resources properly using the >> --host or --hostfile specification. >> >> ... >> >> Any ideas for it? >> >> Regards, >> Stefan Kuhne >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Running simple MPI program
Hi Brandon Does it work if you try this: mpirun -np 2 hostfile hosts.txt ilk (see http://www.open-mpi.org/faq/?category=running#simple-spmd-run) jody On Sat, Oct 23, 2010 at 4:07 PM, Brandon Fulcher wrote: > Thank you for the response! > > The code runs on my own machine as well. Both machines, in fact. And I did > not build MPI but installed the package from the ubuntu repositories. > > The problem occurs when I try to run a job using two machines or simply try > to run it on a slave from the master. > > the actual command I have run along with the output is below: > > mpirun -hostfile hosts.txt ilk > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > > where hosts.txt contains: > 192.168.0.2 cpu=2 > 192.168.0.6 cpu=1 > > > If it matters the same output is given if I define a remote host in the > command such as (if I am on 192.168.0.2) > mpirun -host 192.168.0.6 ilk > > Now if I run it locally, the job succeeds. This works from either cpu. > mpirun ilk > > > Thanks in advance. > > On Fri, Oct 22, 2010 at 11:59 PM, David Zhang wrote: >> >> since you said you're new to MPI, what command did you use to run the 2 >> processes? >> >> On Fri, Oct 22, 2010 at 9:58 PM, David Zhang >> wrote: >>> >>> your code works on mine machine. could be they way you build mpi. >>> >>> On Fri, Oct 22, 2010 at 7:26 PM, Brandon Fulcher >>> wrote: >>>> >>>> Hi, I am completely new to MPI and am having trouble running a job >>>> between two cpus. >>>> >>>> The same thing happens no matter what MPI job I try to run, but here is >>>> a simple 'hello world' style program I am trying to run. >>>> >>>> #include >>>> #include >>>> >>>> int main(int argc, char **argv) >>>> { >>>> int *buf, i, rank, nints, len; >>>> char hostname[256]; >>>> >>>> MPI_Init(&argc,&argv); >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> gethostname(hostname,255); >>>> printf("Hello world! I am process number: %d on host %s\n", rank, >>>> hostname); >>>> MPI_Finalize(); >>>> return 0; >>>> } >>>> >>>> >>>> On either CPU, I can successfully compile and run, but when trying to >>>> run the program using two CPUS it fails with this output: >>>> >>>> >>>> -- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -- >>>> >>>> >>>> With no additional information or errors, What can I do to go about >>>> finding out what is wrong? >>>> >>>> >>>> >>>> I have read the FAQ and followed the instructions. I can ssh into the >>>> slave without entering a password and have the libraries installed on both >>>> machines. >>>> >>>> The only thing pertinent I could find is this faq >>>> http://www.open-mpi.org/faq/?category=running#missing-prereqs but I do not >>>> know if it applies since I have installed open mpi from the Ubuntu >>>> repositories and assume the libraries are correctly set. >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> -- >>> David Zhang >>> University of California, San Diego >> >> >> >> -- >> David Zhang >> University of California, San Diego >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Question about MPI_Barrier
Hi I don't know the reason for the strange behaviour, but anyway, to measure time in an MPI application you should use MPI_Wtime(), not clock() regards jody On Wed, Oct 20, 2010 at 11:51 PM, Storm Zhang wrote: > Dear all, > > I got confused with my recent C++ MPI program's behavior. I have an MPI > program in which I use clock() to measure the time spent between to > MPI_Barrier, just like this: > > MPI::COMM_WORLD.Barrier(); > if if(rank == master) t1 = clock(); > "code A"; > MPI::COMM_WORLD.Barrier(); > if if(rank == master) t2 = clock(); > "code B"; > > I need to measure t2-t1 to see the time spent on the code A between these > two MPI_Barriers. I notice that if I comment code B, the time seems much > less the original time (almost half). How does it happen? What is a possible > reason for it? I have no idea. > > Thanks for your help. > > Linbao > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] my leak or OpenMPI's leak?
But shouldn't something like this show up in the other processes as well? I only see that in the master process, but the slave processes also send data to each other and to the master. On Mon, Oct 18, 2010 at 2:48 PM, Ralph Castain wrote: > > On Oct 18, 2010, at 1:41 AM, jody wrote: > >> I had this leak with OpenMPI 1.4.2 >> >> But in my case, there is no accumulation - when i repeat the same call, >> no additional leak is reported for the second call > > That's because it grabs a larger-than-required chunk of memory just in case > you call again. This helps performance by reducing the number of malloc's in > your application. > > >> >> Jody >> >> On Mon, Oct 18, 2010 at 1:57 AM, Ralph Castain wrote: >>> There is no OMPI 2.5 - do you mean 1.5? >>> >>> On Oct 17, 2010, at 4:11 PM, Brian Budge wrote: >>> >>>> Hi Jody - >>>> >>>> I noticed this exact same thing the other day when I used OpenMPI v >>>> 2.5 built with valgrind support. I actually ran out of memory due to >>>> this. When I went back to v 2.43, my program worked fine. >>>> >>>> Are you also using 2.5? >>>> >>>> Brian >>>> >>>> On Wed, Oct 6, 2010 at 4:32 AM, jody wrote: >>>>> Hi >>>>> I regularly use valgrind to check for leaks, but i ignore the leaks >>>>> clearly created by OpenMPI, >>>>> because i think most of them happen because of efficiency (lose no >>>>> time cleaning up unimportant leaks). >>>>> But i want to make sure no leaks come from my own apps. >>>>> In most of the cases, leaks i am responsible for have the name of one >>>>> of my files at the bottom of the stack printed by valgrind, >>>>> and no internal OpenMPI-calls above, whereas leaks clearly caused by >>>>> OpenMPI have something like >>>>> ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the >>>>> bottom. >>>>> >>>>> Now i have an application where i am completely unsure where the >>>>> responsibility for a particular leak lies. valgrind shows (among >>>>> others) this report >>>>> >>>>> ==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are >>>>> definitely lost in loss record 2,033 of 2,036 >>>>> ==2756== at 0x4005943: malloc (vg_replace_malloc.c:195) >>>>> ==2756== by 0x4049387: ompi_free_list_grow (in >>>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) >>>>> ==2756== by 0x41CA613: ??? >>>>> ==2756== by 0x41BDD91: ??? >>>>> ==2756== by 0x41B0C3D: ??? >>>>> ==2756== by 0x408AC9C: PMPI_Send (in >>>>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) >>>>> ==2756== by 0x8123377: ConnectorBase::send(CollectionBase*, >>>>> std::pair, >>>>> std::pair >&) (ConnectorBase.cpp:39) >>>>> ==2756== by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36) >>>>> ==2756== by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226) >>>>> ==2756== by 0x80C167B: main (TDMain.cpp:24) >>>>> ==2756== >>>>> >>>>> At a first glimpse it looks like an OpenMPI-internal leak, >>>>> because it happens iinside PMPI_Send, >>>>> but then i am using the function ConnectorBase::send() >>>>> several times from other callers than TileConnector, >>>>> but these don't show up in valgrind's output. >>>>> >>>>> Does anybody have an idea what is happening here? >>>>> >>>>> Thank You >>>>> jody >>>>> ___ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] my leak or OpenMPI's leak?
I had this leak with OpenMPI 1.4.2 But in my case, there is no accumulation - when i repeat the same call, no additional leak is reported for the second call Jody On Mon, Oct 18, 2010 at 1:57 AM, Ralph Castain wrote: > There is no OMPI 2.5 - do you mean 1.5? > > On Oct 17, 2010, at 4:11 PM, Brian Budge wrote: > >> Hi Jody - >> >> I noticed this exact same thing the other day when I used OpenMPI v >> 2.5 built with valgrind support. I actually ran out of memory due to >> this. When I went back to v 2.43, my program worked fine. >> >> Are you also using 2.5? >> >> Brian >> >> On Wed, Oct 6, 2010 at 4:32 AM, jody wrote: >>> Hi >>> I regularly use valgrind to check for leaks, but i ignore the leaks >>> clearly created by OpenMPI, >>> because i think most of them happen because of efficiency (lose no >>> time cleaning up unimportant leaks). >>> But i want to make sure no leaks come from my own apps. >>> In most of the cases, leaks i am responsible for have the name of one >>> of my files at the bottom of the stack printed by valgrind, >>> and no internal OpenMPI-calls above, whereas leaks clearly caused by >>> OpenMPI have something like >>> ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the bottom. >>> >>> Now i have an application where i am completely unsure where the >>> responsibility for a particular leak lies. valgrind shows (among >>> others) this report >>> >>> ==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are >>> definitely lost in loss record 2,033 of 2,036 >>> ==2756== at 0x4005943: malloc (vg_replace_malloc.c:195) >>> ==2756== by 0x4049387: ompi_free_list_grow (in >>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) >>> ==2756== by 0x41CA613: ??? >>> ==2756== by 0x41BDD91: ??? >>> ==2756== by 0x41B0C3D: ??? >>> ==2756== by 0x408AC9C: PMPI_Send (in >>> /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) >>> ==2756== by 0x8123377: ConnectorBase::send(CollectionBase*, >>> std::pair, >>> std::pair >&) (ConnectorBase.cpp:39) >>> ==2756== by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36) >>> ==2756== by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226) >>> ==2756== by 0x80C167B: main (TDMain.cpp:24) >>> ==2756== >>> >>> At a first glimpse it looks like an OpenMPI-internal leak, >>> because it happens iinside PMPI_Send, >>> but then i am using the function ConnectorBase::send() >>> several times from other callers than TileConnector, >>> but these don't show up in valgrind's output. >>> >>> Does anybody have an idea what is happening here? >>> >>> Thank You >>> jody >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] connecting to MPI from outside
Hi Mahesh At least in simple cases you can use normal socket functions for this. I used this in order to change the run-time behaviour of an application of a master-worker MPI application. I implemented a simple TCP-Server which runs in a separate thread on the Master processor; connecting to this server i could then send commands which changed the state of the master. Jody On Tue, Oct 12, 2010 at 6:14 AM, Mahesh Salunkhe wrote: > > Hello, > Could you pl tell me how to connect a client(not in any mpi group ) to a > process in a mpi group. > (i.e. just like we do in socket programming by using connect( ) call). > Does mpi provide any call for accepting connections from outside > processes? > > -- > Regards > Mahesh > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] my leak or OpenMPI's leak?
Hi I regularly use valgrind to check for leaks, but i ignore the leaks clearly created by OpenMPI, because i think most of them happen because of efficiency (lose no time cleaning up unimportant leaks). But i want to make sure no leaks come from my own apps. In most of the cases, leaks i am responsible for have the name of one of my files at the bottom of the stack printed by valgrind, and no internal OpenMPI-calls above, whereas leaks clearly caused by OpenMPI have something like ompi_mpi_init, mca_pml_base_open, PMPI_Init etc at or very near the bottom. Now i have an application where i am completely unsure where the responsibility for a particular leak lies. valgrind shows (among others) this report ==2756== 9,704 (8,348 direct, 1,356 indirect) bytes in 1 blocks are definitely lost in loss record 2,033 of 2,036 ==2756==at 0x4005943: malloc (vg_replace_malloc.c:195) ==2756==by 0x4049387: ompi_free_list_grow (in /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) ==2756==by 0x41CA613: ??? ==2756==by 0x41BDD91: ??? ==2756==by 0x41B0C3D: ??? ==2756==by 0x408AC9C: PMPI_Send (in /opt/openmpi-1.4.2.p/lib/libmpi.so.0.0.2) ==2756==by 0x8123377: ConnectorBase::send(CollectionBase*, std::pair, std::pair >&) (ConnectorBase.cpp:39) ==2756==by 0x8123CEE: TileConnector::sendTile() (TileConnector.cpp:36) ==2756==by 0x80C6839: TDMaster::init(int, char**) (TDMaster.cpp:226) ==2756==by 0x80C167B: main (TDMain.cpp:24) ==2756== At a first glimpse it looks like an OpenMPI-internal leak, because it happens iinside PMPI_Send, but then i am using the function ConnectorBase::send() several times from other callers than TileConnector, but these don't show up in valgrind's output. Does anybody have an idea what is happening here? Thank You jody
Re: [OMPI users] a question about [MPI]IO on systems without network filesystem
Hi Paul > Is it possible to configure/run OpenMPI in a such way, that only _one_ > process (e.g. master) performs real disk I/O, and other processes sends the > data to the master which works as an agent? It is possible to run OpenMPI this way, but it is not a matter of configuration, but of implementation alone. > Of course this would impacts the performance, because all data must be send > over network, and the master may became a bottleneck. But is such scenario - > IO of all processes bundled to one process - practicable at all? I think this question can only be answered by trying, because it depends strongly on the volume of your messages and the quality of your hardware (network and disk speed) Jody
Re: [OMPI users] Thread as MPI process
Hi I don't know if i correctly understand what you need, but have you already tried MPI_Comm_spawn? Jody On Mon, Sep 20, 2010 at 11:24 PM, Mikael Lavoie wrote: > Hi, > > I wanna know if it exist a implementation that permit to run a single host > process on the master of the cluster, that will then spawn 1 process per -np > X defined thread at the host specified in the host list. The host will then > act as a syncronized sender/collecter of the work done. > > It would really be the saint-graal of the MPI implementation to me, for the > use i wanna make of it. > > So i wait your answer, hoping that this exist, > > Mikael Lavoie > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_Reduce performance
Hi @Ashley: What is the exact semantics of an asynchronous barrier, and is it part of the MPI specs? Thanks Jody On Thu, Sep 9, 2010 at 9:34 PM, Ashley Pittman wrote: > > On 9 Sep 2010, at 17:00, Gus Correa wrote: > >> Hello All >> >> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary >> words, may be part of a larger context of load balance, or not? >> >> Would Ashley's recipe of sporadic barriers be a silver bullet to >> improve load imbalance problems, regardless of which collectives or >> even point-to-point calls are in use? > > No, it only holds where there is no data dependency between some of the > ranks, in particular if there are any non-rooted collectives in an iteration > of your code then it cannot make any difference at all, likewise if you have > a reduce followed by a barrier using the same root for example then you > already have global synchronisation each iteration and it won't help. My > feeling is that it applies to a significant minority of problems, certainly > the phrase "adding barriers can make codes faster" should be textbook stuff > if it isn't already. > >> Would sporadic barriers in the flux coupler "shake up" these delays? > > I don't fully understand your description but it sounds like it might set the > program back to a clean slate which would give you per-iteraion delays only > rather than cumulative or worse delays. > >> Ashley: How did you get to the magic number of 25 iterations for the >> sporadic barriers? > > Experience and finger in the air. The major factors in picking this number > is the likelihood of a positives feedback cycle of delays happening, the > delays these delays add and the cost of a barrier itself. Having too low a > value will slightly reduce performance, having too high a value can > drastically reduce performance. > > As a further item (because I like them) the asynchronous barrier is even > better again if used properly, in the good case it doesn't cause any process > to block ever so the cost is only that of the CPU cycles the code takes > itself, in the bad case where it has to delay a rank then this tends to have > a positive impact on performance. > >> Would it be application/communicator pattern dependent? > > Absolutely. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMPI Segmentation fault (11)
Hi Jack Yes to both questions. Best to download it directly from their page: http://www.valgrind.org/downloads/current.html then you are sure to get the newest version. Another way to manage your output is to use the '-output-filename' of mpirun (or mpiexec) which will redirect the outputs (stdout, stderr and stddiag) of you processors into separate text files - check the man pages for 'mpirun' If you don't need to see the output of all your processes, but still want to use xterminals, you can use the '-xterm' option of mpirun, where you can select which ranks should open an xterm. (Again check the man pages of mpirun) Jody On Mon, Jul 26, 2010 at 8:55 AM, Jack Bryan wrote: > Thanks > It can be installed on linux and work with gcc ? > If I have many processes, such as 30, I have to open 30 terminal windows ? > thanks > Jack > >> Date: Mon, 26 Jul 2010 08:23:57 +0200 >> From: jody@gmail.com >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] OpenMPI Segmentation fault (11) >> >> Hi Jack >> >> Have you tried to run your aplication under valgrind? >> Even though applications generallay run slower under valgrind, >> it may detect memory errors before the actual crash happens. >> >> The best would be to start a terminal window for each of your processes >> so you can see valgrind's output for each process separately. >> >> Jody >> >> On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan >> wrote: >> > Dear All, >> > I run a 6 parallel processes on OpenMPI. >> > When the run-time of the program is short, it works well. >> > But, if the run-time is long, I got errors: >> > [n124:45521] *** Process received signal *** >> > [n124:45521] Signal: Segmentation fault (11) >> > [n124:45521] Signal code: Address not mapped (1) >> > [n124:45521] Failing at address: 0x44 >> > [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0] >> > [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60] >> > [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19] >> > [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa] >> > [n124:45521] [ 4] /home/path/exec [0x40ec9a] >> > [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) >> > [0x3c5021d974] >> > [n124:45521] [ 6] /home/path/exec [0x401139] >> > [n124:45521] *** End of error message *** >> > It seems that there may be some problems about memory management. >> > But, I cannot find the reason. >> > My program needs to write results to some files. >> > If I open the files too many without closing them, I may get the above >> > errors. >> > But, I have removed the writing files from my program. >> > The problem appears again when the program runs longer time. >> > Any help is appreciated. >> > Jack >> > July 25 2010 >> > >> > >> > Hotmail is redefining busy with tools for the New Busy. Get more from >> > your >> > inbox. See how. >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with > Hotmail. Get busy. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMPI Segmentation fault (11)
Hi Jack Have you tried to run your aplication under valgrind? Even though applications generallay run slower under valgrind, it may detect memory errors before the actual crash happens. The best would be to start a terminal window for each of your processes so you can see valgrind's output for each process separately. Jody On Mon, Jul 26, 2010 at 4:08 AM, Jack Bryan wrote: > Dear All, > I run a 6 parallel processes on OpenMPI. > When the run-time of the program is short, it works well. > But, if the run-time is long, I got errors: > [n124:45521] *** Process received signal *** > [n124:45521] Signal: Segmentation fault (11) > [n124:45521] Signal code: Address not mapped (1) > [n124:45521] Failing at address: 0x44 > [n124:45521] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0] > [n124:45521] [ 1] /lib64/libc.so.6(strlen+0x10) [0x3c50278d60] > [n124:45521] [ 2] /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3c50246b19] > [n124:45521] [ 3] /lib64/libc.so.6(_IO_printf+0x9a) [0x3c5024d3aa] > [n124:45521] [ 4] /home/path/exec [0x40ec9a] > [n124:45521] [ 5] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c5021d974] > [n124:45521] [ 6] /home/path/exec [0x401139] > [n124:45521] *** End of error message *** > It seems that there may be some problems about memory management. > But, I cannot find the reason. > My program needs to write results to some files. > If I open the files too many without closing them, I may get the above > errors. > But, I have removed the writing files from my program. > The problem appears again when the program runs longer time. > Any help is appreciated. > Jack > July 25 2010 > > > Hotmail is redefining busy with tools for the New Busy. Get more from your > inbox. See how. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] hpw to log output of spawned processes
Thanks for the patch - it works fine! Jody On Mon, Jul 12, 2010 at 11:38 PM, Ralph Castain wrote: > Just so you don't have to wait for 1.4.3 to be released, here is the patch. > Ralph > > > > > On Jul 12, 2010, at 2:44 AM, jody wrote: > >> yes, i'm using 1.4.2 >> >> Thanks >> Jody >> >> On Mon, Jul 12, 2010 at 10:38 AM, Ralph Castain wrote: >>> >>> On Jul 12, 2010, at 2:17 AM, jody wrote: >>> >>>> Hi >>>> >>>> I have a master process which spawns a number of workers of which i'd >>>> like to save the output in separate files. >>>> >>>> Usually i use the '-output-filename' option in such a situation. >>>> However, if i do >>>> mpirun -np 1 -output-filename work_out master arg1 arg2 >>>> all the files work_out.1, work_out.2, ... are ok, >>>> but work_out.0 contains both outputs of the master process(process 0 >>>> in COMM_WORLD) and >>>> of the first worker (process 0 in the communicator of the spawned >>>> processes). >>> >>> Crud - that's a bug. >>> >>>> >>>> I also tried the '-tag-output' option, but this involves several >>>> additional steps, >>>> because i have to separate the combined outputs >>>> mpirun -np 1 -tag-output master arg1 arg2 > total.out >>>> grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out >>>> grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out >>>> grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out >>>> ... >>>> Of course, this could be wrapped in a script, but it is a bit cumbersome >>>> (and i am not sure if the job-ids are always "1" and "2") ... >>>> >>>> Is there some simpler way to separate the output of the two streams? >>> >>> Not really. >>> >>>> >>>> If not, would it be possible to extend the -output-filename option i >>>> such a way that it >>>> would also combine job-id and rank withe the output file: >>>> work_out.1.0 >>>> for the master's output, and >>>> work_out.2.0 >>>> work_out.2.1 >>>> work_out.2.2 >>>> ... >>>> for the worker's output? >>> >>> Yeah, I can do that - will put something together. Are you doing this in >>> the 1.4 series? >>> >>>> >>>> Thank You >>>> Jody >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Dynamic process tutorials?
Hi Brian Generally it is possible to create new communicators from existing ones (see for instance the various MPI_GROUP_* functions and MPI_COMM_CREATE) > Also, how can you specify with MPI_Comm_spawn/multiple() how do you > specify IP addresses on which to start the processes? I haven't tried it yet with spawning, but i'd think this would also be done by a rankfile > I would prefer not to use any of the MPI command-line utilities > (mpirun/mpiexec) if that's possible. If you don't like command-line utilities, you can write some graphic tool which will call mpirun or mpiexec. But somewhere you have to tell OpenMPI what to run on how many processors etc. I'd suggest you take a look at the "MPI-The Complete Reference" Vol I and II Jody On Mon, Jul 12, 2010 at 5:07 PM, Brian Budge wrote: > Hi Jody - > > Thanks for the reply. is there a way of "fusing" intercommunicators? > Let's say I have a higher level node scheduler, and it makes a new > node available to a COMM that is already running. So the master > spawns another process for that node. How can the new process > communicate with the other already started processes? > > Also, how can you specify with MPI_Comm_spawn/multiple() how do you > specify IP addresses on which to start the processes? > > If my higher level node scheduler needs to take away a process from my > COMM, is it good/bad for that node to call MPI_Finalize as it exits? > > I would prefer not to use any of the MPI command-line utilities > (mpirun/mpiexec) if that's possible. > > Thanks, > Brian > > On Sat, Jul 10, 2010 at 11:53 PM, jody wrote: >> Hi Brian >> When you spawn processes with MPI_Comm_spawn(), one of the arguments >> will be set to an intercommunicator of thes spawner and the spawnees. >> You can use this intercommunicator as the communicator argument >> in the MPI_functions. >> >> Jody >> On Fri, Jul 9, 2010 at 5:56 PM, Brian Budge wrote: >>> Hi all - >>> >>> I've been looking at the dynamic process features of mpi-2. I have managed >>> to actually launch processes using spawn, but haven't seen examples for >>> actually communicating once these processes are launched. I am additionally >>> interested in how processes created through multiple spawn calls can >>> communicate. >>> >>> Does anyone know of resources that describe these topics? My google-fu must >>> not be up to par :) >>> >>> Thanks, >>> Brian >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMPI how large its buffer size ?
Hi > mpi_irecv(workerNodeID, messageTag, bufferVector[row][column]) OpenMPI contains no function of this form. There is MPI_Irecv, but it takes a different number of arguments. Or is this a boost method? If yes, i guess you have to make sure that the bufferVector[row][column] is large enough... Perhaps there is a boost forum you can check out if the problem persists Jody On Sun, Jul 11, 2010 at 10:13 AM, Jack Bryan wrote: > thanks for your reply. > The message size is 72 bytes. > The master sends out the message package to each 51 nodes. > Then, after doing their local work, the worker node send back the same-size > message to the master. > Master use vector.push_back(new messageType) to receive each message from > workers. > Master use the > mpi_irecv(workerNodeID, messageTag, bufferVector[row][column]) > to receive the worker message. > the row is the rankID of each worker, the column is index for message from > worker. > Each worker may send multiple messages to master. > when the worker node size is large, i got MPI_ERR_TRUNCATE error. > Any help is appreciated. > JACK > July 10 2010 > > > Date: Sat, 10 Jul 2010 23:12:49 -0700 > From: eugene@oracle.com > To: us...@open-mpi.org > Subject: Re: [OMPI users] OpenMPI how large its buffer size ? > > Jack Bryan wrote: > > The master node can receive message ( the same size) from 50 worker nodes. > But, it cannot receive message from 51 nodes. It caused "truncate error". > > How big was the buffer that the program specified in the receive call? How > big was the message that was sent? > > MPI_ERR_TRUNCATE means that you posted a receive with an application buffer > that turned out to be too small to hold the message that was received. It's > a user application error that has nothing to do with MPI's internal > buffers. MPI's internal buffers don't need to be big enough to hold that > message. MPI could require the sender and receiver to coordinate so that > only part of the message is moved at a time. > > I used the same buffer to get the message in 50 node case. > About ""rendezvous" protocol", what is the meaning of "the sender sends a > short portion "? > What is the "short portion", is it a small mart of the message of the sender > ? > > It's at least the message header (communicator, tag, etc.) so that the > receiver can figure out if this is the expected message or not. In > practice, there is probably also some data in there as well. The amount of > that portion depends on the MPI implementation and, in practice, the > interconnect the message traveled over, MPI-implementation-dependent > environment variables set by the user, etc. E.g., with OMPI over shared > memory by default it's about 4Kbytes (if I remember correctly). > > This "rendezvous" protocol" can work automatically in background without > programmer > indicates in his program ? > > Right. MPI actually allows you to force such synchronization with > MPI_Ssend, but typically MPI implementations use it automatically for > "plain" long sends as well even if the user didn't not use MPI_Ssend. > > The "acknowledgement " can be generated by the receiver only when the > corresponding mpi_irecv is posted by the receiver ? > > Right. > > > The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with > Hotmail. Get busy. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] hpw to log output of spawned processes
yes, i'm using 1.4.2 Thanks Jody On Mon, Jul 12, 2010 at 10:38 AM, Ralph Castain wrote: > > On Jul 12, 2010, at 2:17 AM, jody wrote: > >> Hi >> >> I have a master process which spawns a number of workers of which i'd >> like to save the output in separate files. >> >> Usually i use the '-output-filename' option in such a situation. >> However, if i do >> mpirun -np 1 -output-filename work_out master arg1 arg2 >> all the files work_out.1, work_out.2, ... are ok, >> but work_out.0 contains both outputs of the master process(process 0 >> in COMM_WORLD) and >> of the first worker (process 0 in the communicator of the spawned processes). > > Crud - that's a bug. > >> >> I also tried the '-tag-output' option, but this involves several >> additional steps, >> because i have to separate the combined outputs >> mpirun -np 1 -tag-output master arg1 arg2 > total.out >> grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out >> grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out >> grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out >> ... >> Of course, this could be wrapped in a script, but it is a bit cumbersome >> (and i am not sure if the job-ids are always "1" and "2") ... >> >> Is there some simpler way to separate the output of the two streams? > > Not really. > >> >> If not, would it be possible to extend the -output-filename option i >> such a way that it >> would also combine job-id and rank withe the output file: >> work_out.1.0 >> for the master's output, and >> work_out.2.0 >> work_out.2.1 >> work_out.2.2 >> ... >> for the worker's output? > > Yeah, I can do that - will put something together. Are you doing this in the > 1.4 series? > >> >> Thank You >> Jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] hpw to log output of spawned processes
Hi I have a master process which spawns a number of workers of which i'd like to save the output in separate files. Usually i use the '-output-filename' option in such a situation. However, if i do mpirun -np 1 -output-filename work_out master arg1 arg2 all the files work_out.1, work_out.2, ... are ok, but work_out.0 contains both outputs of the master process(process 0 in COMM_WORLD) and of the first worker (process 0 in the communicator of the spawned processes). I also tried the '-tag-output' option, but this involves several additional steps, because i have to separate the combined outputs mpirun -np 1 -tag-output master arg1 arg2 > total.out grep "\[1,0\]" total.out | sed 's/\[1,0\]://' > master.out grep "\[2,0\]" outA | sed 's/\[2,0\]://' > worker_0.out grep "\[2,1\]" outA | sed 's/\[2,1\]://' > worker_1.out ... Of course, this could be wrapped in a script, but it is a bit cumbersome (and i am not sure if the job-ids are always "1" and "2") ... Is there some simpler way to separate the output of the two streams? If not, would it be possible to extend the -output-filename option i such a way that it would also combine job-id and rank withe the output file: work_out.1.0 for the master's output, and work_out.2.0 work_out.2.1 work_out.2.2 ... for the worker's output? Thank You Jody
Re: [OMPI users] Dynamic process tutorials?
Hi Brian When you spawn processes with MPI_Comm_spawn(), one of the arguments will be set to an intercommunicator of thes spawner and the spawnees. You can use this intercommunicator as the communicator argument in the MPI_functions. Jody On Fri, Jul 9, 2010 at 5:56 PM, Brian Budge wrote: > Hi all - > > I've been looking at the dynamic process features of mpi-2. I have managed > to actually launch processes using spawn, but haven't seen examples for > actually communicating once these processes are launched. I am additionally > interested in how processes created through multiple spawn calls can > communicate. > > Does anyone know of resources that describe these topics? My google-fu must > not be up to par :) > > Thanks, > Brian > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OpenMPI how large its buffer size ?
Perhaps i misunderstand your question... Generally, it is the user's job to provide the buffers both to send and receive. If you call MPI_Recv, you must pass a buffer that is large enough to hold the data sent by the corresponding MPI_Send. I.e., if you know your sender will send messages of 100kB, then you must provide a buffer of size 100kB to the receiver. If the message size is unknown at compile time, you may have to send two messages: first an integer which tells the receiver how large a buffer it has to allocate, and then the actual message (which then nicely fits into the freshly allocated buffer) #include #include #include #include "mpi.h" #define SENDER 1 #define RECEIVER 0 #define TAG_LEN 77 #define TAG_DATA 78 #define MAX_MESSAGE 16 int main(int argc, char *argv[]) { int num_procs; int rank; int *send_buf; int *recv_buf; int send_message_size; int recv_message_size; MPI_Status st; int i; /* initialize random numbers */ srand(time(NULL)); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == RECEIVER) { /* the receiver */ /* wait for message length */ MPI_Recv(&recv_message_size, 1, MPI_INT, SENDER, TAG_LEN, MPI_COMM_WORLD, &st); /* create a buffer of the required size */ recv_buf = (int*) malloc(recv_message_size*sizeof(int)); /* get data */ MPI_Recv(recv_buf, recv_message_size, MPI_INT, SENDER, TAG_DATA, MPI_COMM_WORLD, &st); printf("Receiver got %d integers:", recv_message_size); for (i = 0; i < recv_message_size; i++) { printf(" %d", recv_buf[i]); } printf("\n"); /* clean up */ free(recv_buf); } else if (rank == SENDER) { /* the sender */ /* random message size */ send_message_size = (int)((1.0*MAX_MESSAGE*rand())/(1.0*RAND_MAX)); /* create a buffer of the required size */ send_buf = (int*) malloc(send_message_size*sizeof(int)); /* create random message */ for (i = 0; i < send_message_size; i++) { send_buf[i] = rand(); } printf("Sender has %d integers:", send_message_size); for (i = 0; i < send_message_size; i++) { printf(" %d", send_buf[i]); } printf("\n"); /* send message size to receiver */ MPI_Send(&send_message_size, 1, MPI_INT, RECEIVER, TAG_LEN, MPI_COMM_WORLD); /* now send messagge */ MPI_Send(send_buf, send_message_size, MPI_INT, RECEIVER, TAG_DATA, MPI_COMM_WORLD); /* clean up */ free(send_buf); } MPI_Finalize(); } I hope this helps Jody On Sat, Jul 10, 2010 at 7:12 AM, Jack Bryan wrote: > Dear All: > How to find the buffer size of OpenMPI ? > I need to transfer large data between nodes on a cluster with OpenMPI 1.3.4. > Many nodes need to send data to the same node . > Workers use mpi_isend, the receiver node use mpi_irecv. > because they are non-blocking, the messages are stored in buffers of > senders. > And then, the receiver collect messages from its buffer. > If the receiver's buffer is too small, there will be truncate error. > Any help is appreciated. > Jack > July 9 2010 > > > Hotmail is redefining busy with tools for the New Busy. Get more from your > inbox. See how. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated
Hi Jack 100 kbytes are not really big messages sizes. My applications routinely exchange larger amounts. The MPI_ERR_TRUNCATE error means that a buffer you provided to MPI_Recv is too small to hold the data to be received. Check the size of the data you send and compare it with the size of the buffer you passed to MPI_Recv. As Zhang suggested: try to reduce your code to isolate the offending codes. Can you create a simple application with two processes exchanging data which has the MPI_ERR_TRUNCATE problem? Jody On Thu, Jul 8, 2010 at 5:39 AM, Jack Bryan wrote: > thanks > Wat if the master has to send and receive large data package ? > It has to be splited into multiple parts ? > This may increase communication overhead. > I can use MPI_datatype to wrap it up as a specific datatype, which can carry > the > data. > What if the data is very large? 1k bytes or 10 kbytes , 100 kbytes ? > the master need to collect the same datatype from all workers. > So, in this way, the master has to set up a data pool to get all data. > The master's buffer provided by the MPI may not be large enough to do this. > Are there some other ways to do it ? > Any help is appreciated. > thanks > Jack > july 7 2010 > > > From: solarbik...@gmail.com > Date: Wed, 7 Jul 2010 17:32:27 -0700 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Open MPI error MPI_ERR_TRUNCATE: message truncated > > This error typically occurs when the received message is bigger than the > specified buffer size. You need to narrow your code down to offending > receive command to see if this is indeed the case. > > On Wed, Jul 7, 2010 at 8:42 AM, Jack Bryan wrote: > > Dear All: > I need to transfer some messages from workers master node on MPI cluster > with Open MPI. > The number of messages is fixed. > When I increase the number of worker nodes, i got error: > -- > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl >>' > what(): MPI_Unpack: MPI_ERR_TRUNCATE: message truncated > [n231:45873] *** Process received signal *** > [n231:45873] Signal: Aborted (6) > [n231:45873] Signal code: (-6) > [n231:45873] [ 0] /lib64/libpthread.so.0 [0x3c50e0e4c0] > [n231:45873] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x3c50230215] > [n231:45873] [ 2] /lib64/libc.so.6(abort+0x110) [0x3c50231cc0] > > -- > For 40 workers , it works well. > But for 50 workers, it got this error. > The largest message size is not more then 72 bytes. > Any help is appreciated. > thanks > Jack > July 7 2010 > > The New Busy is not the too busy. Combine all your e-mail accounts with > Hotmail. Get busy. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > David Zhang > University of California, San Diego > > > The New Busy is not the too busy. Combine all your e-mail accounts with > Hotmail. Get busy. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Open MPI, cannot get the results from workers
Hi I solved this problem in such a way that my master listens for messages from everybody (MPI_ANY_SOURCE) and reacts to all tags (MPI_ANY_TAG). By looking at the status variable set by MPI_Recv, the master can find out who sent the message (status.MPI_SOURCE) and what tag it has (status.MPI_TAG) and react accordingly Jody On Tue, Jul 6, 2010 at 7:41 AM, David Zhang wrote: > if the master receives multiple results from the same worker, how does the > master know which result (and the associated tag) arrive first? what MPI > commands are you using exactly? > > On Mon, Jul 5, 2010 at 4:25 PM, Jack Bryan wrote: >> >> When the master sends out the task, it assign a distinct task number ID >> to >> the task. >> When the worker receive the task, it still use the task's assigned ID as >> task tag to send it to master. >> Any help is appreciated. >> July 5 2010 >> >> >> >> >> From: solarbik...@gmail.com >> Date: Mon, 5 Jul 2010 13:17:27 -0700 >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] Open MPI, cannot get the results from workers >> >> how does the master receive results from the workers? if a worker is >> sending multiple task results, how does the master knows what the message >> tags are ahead of time? >> >> On Sun, Jul 4, 2010 at 10:26 AM, Jack Bryan >> wrote: >> >> Dear All : >> I designed a master-worker framework, in which the master can schedule >> multiple tasks (numTaskPerWorkerNode) to each worker and then collects >> results from workers. >> if the numTaskPerWorkerNode = 1, it works well. >> But, if numTaskPerWorkerNode > 1, the master cannot get the results from >> workers. >> But, the workers can get the tasks from master. >> why ? >> >> I have used different taskTag to distinguish the tasks, but still does not >> work. >> Any help is appreciated. >> Thanks, >> Jack >> July 4 2010 >> >> The New Busy is not the too busy. Combine all your e-mail accounts with >> Hotmail. Get busy. >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> -- >> David Zhang >> University of California, San Diego >> >> >> The New Busy is not the old busy. Search, chat and e-mail from your inbox. >> Get started. >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > David Zhang > University of California, San Diego > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Open MPI task scheduler
Hi I think your problem can be solved easily on the MPI level. Just hav you manager execute a loop in which it waits for any message. Define different message types by their MPI-tags. Once a message has been received, decide what to do by looking at the tag. Here i assume that a worker with no job sends a message with the tag TAG_TASK_REQUEST and then waits to receive a message from the master with either a new task or the command to exit. Once a worker has finished a tsk it sends a message with the tag TAG_RESULT, and then sends a message containing the result. Here i assume that new tasks can be sent from a different node by using the tag TAG_NEW_TASK. The main loop in the Master would be: while (more_tasks) { MPI_Recv(&a, MPI_INT, 1, MPI_ANY_SOURCE, MPI_ANY_TAG, &st); switch (st.MPI_TAG) { case TAG_TASK_REQUEST: sendNextTask(st.MPI_SOURCE); break; case TAG_RESULT: collectResult(st.MPI_SOURCE); break; case TAG_NEW_TASK: putNewTaskOnQueue(st.MPI_SOURCE); break; } } In a worker: while (go_on) { MPI_Send(a, MPI_INT, 1, idMaster, TAG_TASK_REQUEST); MPI_Recv(&TaskDef, TaskType, 1, idMaster, MPI_ANY_TAG, &st); if (st.MPI_TAG == TAG_STOP) { go_on=false; } else { result=workOnTask(TaskDef, TaskLen); MPI_Send(a, MPI_INT, 1, idMaster, TAG_RESULT); MPI_Send(result, resultType, 1, idMaster, TAG_RESULT_CONTENT); } } I hope this helps Jody On Mon, Jun 21, 2010 at 12:17 AM, Jack Bryan wrote: > Hi, > thank you very much for your help. > What is the meaning of " must find a system so that every task can be > serialized in the same form." What is the meaning of "serize " ? > I have no experience of programming with python and XML. > I have studied your blog. > Where can I find a simple example to use the techniques you have said ? > For exmple, I have 5 task (print "hello world !"). > I want to use 6 processors to do it in parallel. > One processr is the manager node who distributes tasks and other 5 > processors > do the printing jobs and when they are done, they tell this to the manager > noitde. > > Boost.Asio is a cross-platform C++ library for network and low-level I/O > programming. I have no experiences of using it. Will it take a long time to > learn > how to use it ? > If the messages are transferred by SOAP+TCP, how the manager node calls it > and push task into it ? > Do I need to install SOAP+TCP on my cluster so that I can use it ? > > Any help is appreciated. > Jack > June 20 2010 >> Date: Sun, 20 Jun 2010 21:00:06 +0200 >> From: matthieu.bruc...@gmail.com >> To: us...@open-mpi.org >> Subject: Re: [OMPI users] Open MPI task scheduler >> >> 2010/6/20 Jack Bryan : >> > Hi, Matthieu: >> > Thanks for your help. >> > Most of your ideas show that what I want to do. >> > My scheduler should be able to be called from any C++ program, which can >> > put >> > a list of tasks to the scheduler and then the scheduler distributes the >> > tasks to other client nodes. >> > It may work like in this way: >> > while(still tasks available) { >> > myScheduler.push(tasks); >> > myScheduler.get(tasks results from client nodes); >> > } >> >> Exactly. In your case, you want only one server, so you must find a >> system so that every task can be serialized in the same form. The >> easiest way to do so is to serialize your parameter set as an XML >> fragment and add the type of task as another field. >> >> > My cluster has 400 nodes with Open MPI. The tasks should be transferred >> > b y >> > MPI protocol. >> >> No, they should not ;) MPI can be used, but it is not the easiest way >> to do so. You still have to serialize your ticket, and you have to use >> some functions that are from MPI2 (so perhaps not as portable as MPI1 >> functions). Besides, it cannot be used from programs that do not know >> of using MPI protocols. >> >> > I am not familiar with RPC Protocol. >> >> RPC is not a protocol per se. SOAP is. RPC stands for Remote Procedure >> Call. It is basically your scheduler that has several functions >> clients can call: >> - add tickets >> - retrieve ticket >> - ticket is done >> >> > If I use Boost.ASIO and some Python/GCCXML script to generate the code, >> > it >> > can be >> > called from C++ program on Open MPI cluster ? >> >> Yes, SOAP is just an XML way of representing the fact that you call a >> function on the server. You can use it with C++, Java, ... I use it >> with Python to monitor
Re: [OMPI users] Allgather in inter-communicator bug,
Hi I am really no python expert, but it looks to me as if you were gathering arrays filled with zeroes: a = array('i', [0]) * n Shouldn't this line be a = array('i', [r])*n where r is the rank of the process? Jody On Thu, May 20, 2010 at 12:00 AM, Battalgazi YILDIRIM wrote: > Hi, > > > I am trying to use intercommunicator ::Allgather between two child process. > I have fortran and Python code, > I am using mpi4py for python. It seems that ::Allgather is not working > properly in my desktop. > > I have contacted first mpi4py developers (Lisandro Dalcin), he simplified > my problem and provided two example files (python.py and fortran.f90, > please see below). > > We tried with different MPI vendors, the following example worked correclty( > it means the final print out should be array('i', [1, 2, 3, 4, 5, 6, 7, 8]) > ) > > However, it is not giving correct answer in my two desktop (Redhat and > ubuntu) both > using OPENMPI > > Could yo look at this problem please? > > If you want to follow our discussion before you, you can go to following > link: > http://groups.google.com/group/mpi4py/browse_thread/thread/c17c660ae56ff97e > > yildirim@memosa:~/python_intercomm$ more python.py > from mpi4py import MPI > from array import array > import os > > progr = os.path.abspath('a.out') > child = MPI.COMM_WORLD.Spawn(progr,[], 8) > n = child.remote_size > a = array('i', [0]) * n > child.Allgather([None,MPI.INT],[a,MPI.INT]) > child.Disconnect() > print a > > yildirim@memosa:~/python_intercomm$ more fortran.f90 > program main > use mpi > implicit none > integer :: parent, rank, val, dummy, ierr > call MPI_Init(ierr) > call MPI_Comm_get_parent(parent, ierr) > call MPI_Comm_rank(parent, rank, ierr) > val = rank + 1 > call MPI_Allgather(val, 1, MPI_INTEGER, & > dummy, 0, MPI_INTEGER, & > parent, ierr) > call MPI_Comm_disconnect(parent, ierr) > call MPI_Finalize(ierr) > end program main > > yildirim@memosa:~/python_intercomm$ mpif90 fortran.f90 > > yildirim@memosa:~/python_intercomm$ python python.py > array('i', [0, 0, 0, 0, 0, 0, 0, 0]) > > > -- > B. Gazi YILDIRIM > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] How to show outputs from MPI program that runs on a cluster?
Hi mpirun has an option for this (check the mpirun man page): -tag-output, --tag-output Tag each line of output to stdout, stderr, and stddiag with [jobid, rank] indicating the process jobid and rank that generated the output, and the channel which generated it. Using this you can filter the entire output by grepping for the required rank. Another possibility is to use the option -xterm, --xterm Display the specified ranks in separate xterm windows. The ranks are specified as a comma-separated list of ranges, with a -1 indicating all. A separate window will be created for each specified rank. Note: In some environments, xterm may require that the executable be in the user’s path, or be specified in absolute or relative terms. Thus, it may be necessary to specify a local executable as "./foo" instead of just "foo". If xterm fails to find the executable, mpirun will hang, but still respond correctly to a ctrl-c. If this happens, please check that the exe- cutable is being specified correctly and try again. That way you can open a single terminal window for the process you are interested in. Jody On Thu, May 20, 2010 at 1:28 AM, Sang Chul Choi wrote: > Hi, > > I am wondering if there is a way to run a particular process among multiple > processes on the console of a linux cluster. > > I want to see the screen output (standard output) of a particular process > (using a particular ID of a process) on the console screen while the MPI > program is running. I think that if I run a MPI program on a linux cluster > using Sun Grid Engine, the particular process that prints out to standard > output could run on the console or computing node. And, it would be hard to > see screen output of the particular process. Is there a way to to set one > process aside and to run it on the console in Sun Grid Engine? > > When I run the MPI program on my desktop with quad cores, I can set aside one > process using an ID to print information that I need. I do not know how I > could do that in much larger scale like using Sun Grid Engine. I could let > one process print out in a file and then I could see it. I do not know how I > could let one process to print out on the console screen by setting it to run > on the console using Sun Grid Engine or any other similar thing such as PBS. > I doubt that a cluster would allow jobs to run on the console because then > others users would have to be in trouble in submitting jobs. If this is the > case, there seem no way to print out on the console. Then, do I have to > have a separate (non-MPI) program that can communicate with MPI program using > TCP/IP by running the separate program on the master node of a cluster? This > separate non-MPI program may then communicate sporadically with the MPI > program. I do not know if this is a general approach or a peculiar way. > > I will appreciate any of input. > > Thank you, > > Sang Chul > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Dynamic libraries in OpenMPI
Just to be sure: Is there a copy of the shared library on the other host (hpcnode1) ? jody On Mon, May 10, 2010 at 5:20 PM, Prentice Bisbal wrote: > Are you runing thee jobs through a queuing system like PBS, Torque, or SGE? > > Prentice > > Miguel Ángel Vázquez wrote: >> Hello Prentice, >> >> Thank you for your advice but that doesn't solve the problem. >> >> The non-login bash updates properly the $LD_LIBRARY_PATH value. >> >> Any other idea? >> >> Thanks, >> >> Miguel >> >> 2010/5/7 Prentice Bisbal mailto:prent...@ias.edu>> >> >> >> >> Miguel Ángel Vázquez wrote: >> > Dear all, >> > >> > I am trying to run a C++ program which uses dynamic libraries >> under mpi. >> > >> > The compilation command looks like: >> > >> > mpiCC `pkg-config --cflags itpp` -o montecarlo montecarlo.cpp >> > `pkg-config --libs itpp` >> > >> > And it works if I executed it in one machine: >> > >> > mpirun -np 2 -H localhost montecarlo >> > >> > I tested this both in the "master node" and in the "compute nodes" and >> > it works. However, when I try to run it with two different machines: >> > >> > mpirun -np 2 -H localhost,hpcnode1 montecarlo >> > >> > The program claims that it can't find the shared libraries: >> > >> > montecarlo: error while loading shared libraries: libitpp.so.6: cannot >> > open shared object file: No such file or directory >> > >> > The LD_LIBRARY_PATH is set properly at every machine, any idea >> where the >> > problem is? I attached you the config.log and the result of the >> omp-info >> > --all >> > >> > Thank you in advance, >> > >> > Miguel >> >> Miguel, >> >> Shells behave differently depending on whether it is an interactive >> login shell or a non-interactive shell. For example, the bash shell uses >> .bash_profile in case, but .bashrc in the other. Check the documentation >> for your shell and see what files it uses in each case, and make sure >> the non-login config file has the necessary settings for your MPI jobs. >> It sounds like your login shell environment is okay, but your non-login >> environment isn't setup correctly. This is a common problem. >> >> I use bash, and to keep it simple, my .bash_profile is just a symbolic >> link to .bashrc. That way, both shell types have the same environment. >> This isn't always a good idea, but in my case it's fine. >> >> -- >> Prentice > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS
Hi Asad I must admit i don't know how one can find out whether extended precision is being used or not. I think one has to read up on the CPU's information. I only know that most Intel 32bit-Processors use the extended precision http://en.wikipedia.org/wiki/X86 as does AMD Athlon http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/fpu_wp.pdf but i think AMD Opteron does not. But i am no expert in this area - i only found out about this when i mentioned to someone the differences in the results obtained from a 32Bit platform and a 64bit platform. Sorry. Jody On Mon, Apr 26, 2010 at 4:33 AM, Asad Ali wrote: > Hi Jodi, > >> I once got different results when running on a 64-Bit platform instead of >> a 32 bit platform - if i remember correctly, the reason was that on the >> 32-bit platform 80bit extended precision floats were used but on the 64bit >> platform only 64bit floats. > > Could you please give me an idea as how to check this extended precision. > Also I don't use float rather I use only double or long double. > > Cheers, > > Asad > -- > "Statistical thinking will one day be as necessary for efficient citizenship > as the ability to read and write." - H.G. Wells > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] open-mpi behaviour on Fedora, Ubuntu, Debian and CentOS
I once got different results when running on a 64-Bit platform instead of a 32 bit platform - if i remember correctly, the reason was that on the 32-bit platform 80bit extended precision floats were used but on the 64bit platform only 64bit floats. On Sun, Apr 25, 2010 at 3:39 AM, Fabian Hänsel wrote: > Hi Asad, > >> I found that running the same source code on these OS, with the same >> versions of of gcc and open-mpi installed on them, gives different results >> than Fedora and Ubuntu after a few hundred iterations. The first >> few hundered iterations are exactly similar to that of Fedora and Ubuntu >> but then it starts giving different results. > > Are you also using the same hardware? Different hardware platforms may > exhibit different rounding behaviour. After some dependent interations such > effects might indeed sum up and yield different results. The issue is called > numeric (in)stability and is not specifically related to openmpi. > > Best regards > Fabian > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Error on sending argv
Hi You should remove the "&" for the first parameters of your MPI_Send and MPI_Recv: MPI_Send(text, strlen(text) + 1, MPI_CHAR, 1, 0, MPI_COMM_WORLD); MPI_Recv(buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); In C/C++ the name of an array is a pointer to the start of the array (however, i can't exactly explain why it worked with the hard-coded string)) Jody On Mon, Apr 19, 2010 at 6:31 PM, Andrew Wiles wrote: > Hi all Open MPI users, > > I write a simple MPI program to send a text message to another process. The > code is below. > > (test.c) > > #include "mpi.h" > > #include > > #include > > #include > > > > int main(int argc, char* argv[]) { > > int dest, noProcesses, processId; > > MPI_Status status; > > > > char* buffer; > > > > char* text = "ABCDEF"; > > > > MPI_Init(&argc, &argv); > > MPI_Comm_size(MPI_COMM_WORLD, &noProcesses); > > MPI_Comm_rank(MPI_COMM_WORLD, &processId); > > > > buffer = (char*) malloc(256 * sizeof(char)); > > > > if (processId == 0) { > > fprintf(stdout, "Master: sending %s to %d\n", text, 1); > > MPI_Send((void *)&text, strlen(text) + 1, MPI_CHAR, 1, 0, > MPI_COMM_WORLD); > > } else { > > MPI_Recv(&buffer, 128, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, > MPI_COMM_WORLD, &status); > > fprintf(stdout, "Slave: received %s from %d\n", buffer, > status.MPI_SOURCE); > > } > > MPI_Finalize(); > > return 0; > > } > > After compiling and executing it I get the following output: > > [root@cluster Desktop]# mpicc -o test test.c > > [root@cluster Desktop]# mpirun -np 2 test > > Master: sending ABCDEF to 1 > > Slave: received ABCDEF from 0 > > > > In the source code above, I replace > > char* text = "ABCDEF"; > > by > > char* text = argv[1]; > > then compile and execute it again with the following commands: > > [root@cluster Desktop]# mpicc -o test test.c > > [root@cluster Desktop]# mpirun -np 2 test ABCDEF > > Then I get the following output: > > Master: sending ABCDEF to 1 > > [cluster:03917] *** Process received signal *** > > [cluster:03917] Signal: Segmentation fault (11) > > [cluster:03917] Signal code: Address not mapped (1) > > [cluster:03917] Failing at address: 0xbfa445a2 > > [cluster:03917] [ 0] [0x959440] > > [cluster:03917] [ 1] /lib/libc.so.6(_IO_fprintf+0x22) [0x76be02] > > [cluster:03917] [ 2] test(main+0x143) [0x80488b7] > > [cluster:03917] [ 3] /lib/libc.so.6(__libc_start_main+0xdc) [0x73be8c] > > [cluster:03917] [ 4] test [0x80486c1] > > [cluster:03917] *** End of error message *** > > -- > > mpirun noticed that process rank 1 with PID 3917 on node cluster.hpc.org > exited on signal 11 (Segmentation fault). > > -- > > I’m very confused because the only difference between the two source codes > is the difference between > > char* text = "ABCDEF"; > > and > > char* text = argv[1]; > > Can any one help me why the results are so different? How can I send argv[i] > to another process? > > Thank you very much! > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Help om Openmpi
@Trent > the 1024 RSA has already been cracked. Yeah but unless you've got 3 guys spending 100 hours varying the voltage of your processors it is still safe... :) On Tue, Apr 6, 2010 at 11:35 AM, Reuti wrote: > Hi, > > Am 06.04.2010 um 09:48 schrieb Terry Frankcombe: > >>> 1. Run the following command on the client >>> * -> ssh-keygen -t dsa >>> 2. File id_dsa and id_dsa.pub will be created inside $HOME/.ssh >>> 3. Copy id_dsa.pub to the server's .ssh directory >>> * -> scp $HOME/.ssh/id_dsa.pub user@server:/home/user/.ssh >>> 4. Change to /root/.ssh and create file authorized_keys containing >>> id_dsa content >>> * -> cd /home/user/.ssh >>> * -> cat id_dsa >> authorized_keys >>> 5. You can try ssh to the server from the client and no password >>> will be needed >>> * -> ssh user@server >> >> That prescription is a little messed up. You need to create id_dsa and >> id_dsa.pub on the client, as above. >> >> But it is the client's id_dsa.pub that needs to go >> into /home/user/.ssh/authorized_keys on the server, which seems to be >> not what the above recipe does. >> >> If that doesn't help, try adding -v or even -v -v to the ssh command to >> see what the connection is trying to do w.r.t. your keys. > > inside a cluster I suggest hostbased authentication. No keys for the user, a > common used ssh_known_hosts file and a central place to look for errors. > > Passphraseless ssh-keys I just dislike as they tempt the user to copy them to > all remote location (especially the private part) to get more comfort while > using ssh between two remote clusters, but using an ssh-agent would in this > case be a more secure option. > > -- Reuti > > >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] openMPI on Xgrid
On Mar 30, 2010, at 11:12 AM, Cristobal Navarro wrote: i just have some questions, Torque requires moab, but from what i've read on the site you have to buy moab right? I am pretty sure you can download torque w/o moab. I do not use moab, which I think is a higher-level scheduling layer on top of pbs. However, there are folks here who would know far more than I do about these sorts of things. Cheers, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] openMPI on Xgrid
I have an environment a few trusted users could use to test. However, I have neither the expertise or time to do the debugging myself. Cheers, Jody On 2010-03-29, at 1:27 PM, Jeff Squyres wrote: On Mar 29, 2010, at 4:11 PM, Cristobal Navarro wrote: i realized that xcode dev tools include openMPI 1.2.x should i keep trying?? or do you recommend to completly abandon xgrid and go for another tool like Torque with openMPI? FWIW, Open MPI v1.2.x is fairly ancient -- the v1.4 series includes a few years worth of improvements and bug fixes since the 1.2 series. It would be great (hint hint) if someone could fix the xgrid support for us... We simply no longer have anyone in the active development group who has the expertise or test environment to make our xgrid work. :-( -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openMPI on Xgrid
On Mar 29, 2010, at 12:39 PM, Ralph Castain wrote: On Mar 29, 2010, at 1:34 PM, Cristobal Navarro wrote: thanks for the information, but is it possible to make it work with xgrid or the 1.4.1 version just dont support it? FWIW, I've had excellent success with Torque and openmpi on OS-X 10.5 Server. http://www.clusterresources.com/products/torque-resource-manager.php It doesn't have a nice dashboard, but the queue tools are more than adequate for my needs. Open MPI had a funny port issue on my setup that folks helped with From my notes: Edited /Network/Xgrid/openmpi/etc/openmpi-mca-params.conf to make sure that the right ports are used: # set ports so that they are more valid than the default ones (see email from Ralph Castain) btl_tcp_port_min_v4 = 36900 btl_tcp_port_range = 32 Cheers, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] Segmentation fault (11)
I'm not sure if this is the cause of your problems: You define the constant BUFFER_SIZE, but in the code you use a constant called BUFSIZ... Jody On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam wrote: > Dear All, > I am having a problem with openmpi . I have installed openmpi > 1.4 and blcr 0.8.1 > > I have written a small mpi application as follows below: > > ### > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define BUFFER_SIZE PIPE_BUF > > char * getprocessid() > { > FILE * read_fp; > char buffer[BUFSIZ + 1]; > int chars_read; > char * buffer_data="12345"; > memset(buffer, '\0', sizeof(buffer)); > read_fp = popen("uname -a", "r"); > /* > ... > */ > return buffer_data; > } > > int main(int argc, char ** argv) > { > MPI_Status status; > int rank; >int size; > char * thedata; > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD,&size); > MPI_Comm_rank(MPI_COMM_WORLD,&rank); > thedata=getprocessid(); > printf(" the data is %s", thedata); > MPI_Finalize(); > } > > > I get the following result: > > ### > jean@sunn32:~$ mpicc pipetest2.c -o pipetest2 > jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl > ^openib pipetest2 > [sun32:19211] *** Process received signal *** > [sun32:19211] Signal: Segmentation fault (11) > [sun32:19211] Signal code: Address not mapped (1) > [sun32:19211] Failing at address: 0x4 > [sun32:19211] [ 0] [0xb7f3c40c] > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b] > [sun32:19211] [ 2] /usr/local/blcr/lib/libcr.so.0(cri_info_free+0x2a) > [0xb7a5925a] > [sun32:19211] [ 3] /usr/local/blcr/lib/libcr.so.0 [0xb7a5ac72] > [sun32:19211] [ 4] /lib/libc.so.6(__libc_fork+0x186) [0xb7991266] > [sun32:19211] [ 5] /lib/libc.so.6(_IO_proc_open+0x7e) [0xb7958b6e] > [sun32:19211] [ 6] /lib/libc.so.6(popen+0x6c) [0xb7958dfc] > [sun32:19211] [ 7] pipetest2(getprocessid+0x42) [0x8048836] > [sun32:19211] [ 8] pipetest2(main+0x4d) [0x8048897] > [sun32:19211] [ 9] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7912455] > [sun32:19211] [10] pipetest2 [0x8048761] > [sun32:19211] *** End of error message *** > # > > > However, If I compile the application using gcc, it works fine. The problem > arises with: > read_fp = popen("uname -a", "r"); > > Does anyone has an idea how to resolve this problem? > > Many thanks > > Jean > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] problems on parallel writing
Hi Just wanted to let you know: I translated your program to C ran it, and it crashed at MPI_FILE_SET_VIEW in a similar way than yours did. then i added an if-clause to prevent the call of MPI_FILE_WRITE with the undefined value. if (myid == 0) { MPI_File_write(fh, temp, count, MPI_DOUBLE, &status); } After this it ran without crash. However, the output is not what you expected: The number 2122010.0 was not there - probably overwritten by the MPI_FILE_WRITE_ALL. But this was fixed by replacing the line disp=0 by disp=8 and removing the if (single_no .gt. 0) map = map + 1 statement. So here's what all looks like: === program test_MPI_write_adv2 !-- Template for any mpi program implicit none !--Include the mpi header file include 'mpif.h' ! --> Required statement !--Declare all variables and arrays. integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info integer :: status(MPI_STATUS_SIZE) integer :: irc, ip integer(kind=mpi_offset_kind) :: offset, disp integer :: i, j, k integer :: num character(len=64) :: filename real(8), pointer :: q(:), temp(:) integer, pointer :: map(:) integer :: single_no, count !--Initialize MPI call MPI_INIT( ierr ) ! --> Required statement !--Who am I? --- get my rank=myid call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) !--How many processes in the global group? call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) if ( myid == 0 ) then single_no = 4 elseif ( myid == 1 ) then single_no = 2 elseif ( myid == 2 ) then single_no = 2 elseif ( myid == 3 ) then single_no = 3 else single_no = 0 end if if (single_no .gt. 0) allocate(map(single_no)) if ( myid == 0 ) then map = (/ 0, 2, 5, 6 /) elseif ( myid == 1 ) then map = (/ 1, 4 /) elseif ( myid == 2 ) then map = (/ 3, 9 /) elseif ( myid == 3 ) then map = (/ 7, 8, 10 /) end if if (single_no .gt. 0) allocate(q(single_no)) if (single_no .gt. 0) then do i = 1,single_no q(i) = dble(myid+1)*100.0d0 + dble(map(i)+1) end do end if if ( myid == 0 ) then count = 1 else count = 0 end if if (count .gt. 0) then allocate(temp(count)) temp(1) = 2122010.0d0 end if write(filename,'(a)') 'test_write.bin' call MPI_FILE_OPEN(MPI_COMM_WORLD, filename, MPI_MODE_RDWR+MPI_MODE_CREATE, MPI_INFO_NULL, fh, ierr) if (my_id == 0) then call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR) endif call MPI_TYPE_CREATE_INDEXED_BLOCK(single_no, 1, map, MPI_DOUBLE_PRECISION, filetype, ierr) call MPI_TYPE_COMMIT(filetype, ierr) disp = 8 ! ---> size of MPI_REAL8 (number written when my_id = 0) call MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION, filetype, 'native', MPI_INFO_NULL, ierr) call MPI_FILE_WRITE_ALL(fh, q, single_no, MPI_DOUBLE_PRECISION, status, ierr) call MPI_FILE_CLOSE(fh, ierr) if (single_no .gt. 0) deallocate(map) if (single_no .gt. 0) deallocate(q) if (count .gt. 0) deallocate(temp) !--Finilize MPI call MPI_FINALIZE(irc)! ---> Required statement stop end program test_MPI_write_adv2 === Regards jody On Thu, Feb 25, 2010 at 2:47 AM, Terry Frankcombe wrote: > On Wed, 2010-02-24 at 13:40 -0500, w k wrote: >> Hi Jordy, >> >> I don't think this part caused the problem. For fortran, it doesn't >> matter if the pointer is NULL as long as the count requested from the >> processor is 0. Actually I tested the code and it passed this part >> without problem. I believe it aborted at MPI_FILE_SET_VIEW part. >> > > For the record: A pointer is not NULL unless you've nullified it. > IIRC, the Fortran standard says that any non-assigning reference to an > unassigned, unnullified pointer is undefined (or maybe illegal... check > the standard). > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] problems on parallel writing
Hi I can't answer your question about the array q offhand, but i will try to translate your program to C and see if it fails the same way. Jody On Wed, Feb 24, 2010 at 7:40 PM, w k wrote: > Hi Jordy, > > I don't think this part caused the problem. For fortran, it doesn't matter > if the pointer is NULL as long as the count requested from the processor is > 0. Actually I tested the code and it passed this part without problem. I > believe it aborted at MPI_FILE_SET_VIEW part. > > Just curious, how does C handle the case that we need to collect data in > array q but only part of the processors has q with a length greater than 0? > > Thanks for your reply, > Kan > > > > > On Wed, Feb 24, 2010 at 2:29 AM, jody wrote: >> >> Hi >> I know nearly nothing about fortran >> but it looks to me as the pointer 'temp' in >> >> > call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR) >> >> is not defined (or perhaps NULL?) for all processors except processor 0 : >> >> > if ( myid == 0 ) then >> > count = 1 >> > else >> > count = 0 >> > end if >> > >> > if (count .gt. 0) then >> > allocate(temp(count)) >> > temp(1) = 2122010.0d0 >> > end if >> >> In C/C++ something like this would almost certainly lead to a crash, >> but i don't know if this would be the case in Fortran... >> jody >> >> >> On Wed, Feb 24, 2010 at 4:38 AM, w k wrote: >> > Hello everyone, >> > >> > >> > I'm trying to implement some functions in my code using parallel >> > writing. >> > Each processor has an array, say q, whose length is single_no(could be >> > zero >> > on some processors). I want to write q down to a common file, but the >> > elements of q would be scattered to their locations in this file. The >> > locations of the elements are described by a map. I wrote my testing >> > code >> > according to an example in a MPI-2 tutorial which can be found here: >> > www.npaci.edu/ahm2002/ahm_ppt/Parallel_IO_MPI_2.ppt. This way of writing >> > is >> > called "Accessing Irregularly Distributed Arrays" in this tutorial and >> > the >> > example is given in page 42. >> > >> > I tested my code with mvapich and got the result as expected. But when I >> > tested it with openmpi, it didn't work. I tried the version 1.2.8 and >> > 1.4 >> > and both didn't work. I tried two clusters. Both of them are intel chips >> > (woodcrest and nehalem), DDR infiniband with Linux system. I got some >> > error >> > message like >> > >> > +++ >> > [n0883:08251] *** Process received signal *** >> > [n0883:08249] *** Process received signal *** >> > [n0883:08249] Signal: Segmentation fault (11) >> > [n0883:08249] Signal code: Address not mapped (1) >> > [n0883:08249] Failing at address: (nil) >> > [n0883:08251] Signal: Segmentation fault (11) >> > [n0883:08251] Signal code: Address not mapped (1) >> > [n0883:08251] Failing at address: (nil) >> > [n0883:08248] *** Process received signal *** >> > [n0883:08250] *** Process received signal *** >> > [n0883:08248] Signal: Segmentation fault (11) >> > [n0883:08248] Signal code: Address not mapped (1) >> > [n0883:08248] Failing at address: (nil) >> > [n0883:08250] Signal: Segmentation fault (11) >> > [n0883:08250] Signal code: Address not mapped (1) >> > [n0883:08250] Failing at address: (nil) >> > [n0883:08251] [ 0] /lib64/libpthread.so.0 [0x2b4f0a2f0d60] >> > +++ >> > >> > >> > >> > My testing code is here: >> > >> > >> > === >> > program test_MPI_write_adv2 >> > >> > >> > !-- Template for any mpi program >> > >> > implicit none >> > >> > !--Include the mpi header file >> > include 'mpif.h' ! --> Required statement >> > >> > !--Declare all variables and arrays. >> > integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info >> > integer :: status(MPI_STATUS_SIZE) >> > integer :: irc, ip >> > integer(kind=mpi_offset_kind) :: offset, disp >> > integer :: i, j, k >
Re: [OMPI users] MPi Abort verbosity
Hi Gabriele you could always pipe your output through grep my_app | grep "MPI_ABORT was invoked" jody On Wed, Feb 24, 2010 at 11:28 AM, Gabriele Fatigati wrote: > Hi Nadia, > > thanks for quick reply. > > But i suppose that parameter is 0 by default. Suppose i have the follw > output: > > - -- > > - --> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD > with errorcode 4. <-- > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > - -- > Inside my_mpi_err_handler > Inside my_mpi_err_handler > I am 0 and we are in 2 > I am 1 and we are in 2 > - -- > mpirun has exited due to process rank 0 with PID 3773 on > node nb-user exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > - -- > - -- > > I would like to see only this: > > - --> MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD > with errorcode 4. <-- > > And nothing else. Is it possible? > > I can upgrade my OpenMPI if necessary. > > Thanks. > > > 2010/2/24 Nadia Derbey >> >> On Wed, 2010-02-24 at 09:55 +0100, Gabriele Fatigati wrote: >> > >> > Dear Openmpi users and developers, >> > >> > i have a question about MPI_Abort error message. I have a program >> > written in C++. Is there a way to decrease a verbosity of this error? >> > When this function is called, openmpi prints many information like >> > stack trace, rank of processor who called MPI_Abort ecc.. But i'm >> > interesting just called rank. Is it possible? >> >> Hi, >> >> Setting the mca parameter "mpi_abort_print_stack" to 0 makes the stack >> not printed out. >> > >> > Thanks in advance. >> > >> > I'm using openmpi 1.2.2 >> >> ... well, don't know if it's available in that release... >> >> >> Regards, >> Nadia >> > -- >> > Ing. Gabriele Fatigati >> > >> > Parallel programmer >> > >> > CINECA Systems & Tecnologies Department >> > >> > Supercomputing Group >> > >> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> > >> > www.cineca.it Tel: +39 051 6171722 >> > >> > g.fatigati [AT] cineca.it >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> -- >> Nadia Derbey >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.it Tel: +39 051 6171722 > > g.fatigati [AT] cineca.it > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] problems on parallel writing
Hi I know nearly nothing about fortran but it looks to me as the pointer 'temp' in > call MPI_FILE_WRITE(FH, temp, COUNT, MPI_REAL8, STATUS, IERR) is not defined (or perhaps NULL?) for all processors except processor 0 : > if ( myid == 0 ) then > count = 1 > else > count = 0 > end if > > if (count .gt. 0) then > allocate(temp(count)) > temp(1) = 2122010.0d0 > end if In C/C++ something like this would almost certainly lead to a crash, but i don't know if this would be the case in Fortran... jody On Wed, Feb 24, 2010 at 4:38 AM, w k wrote: > Hello everyone, > > > I'm trying to implement some functions in my code using parallel writing. > Each processor has an array, say q, whose length is single_no(could be zero > on some processors). I want to write q down to a common file, but the > elements of q would be scattered to their locations in this file. The > locations of the elements are described by a map. I wrote my testing code > according to an example in a MPI-2 tutorial which can be found here: > www.npaci.edu/ahm2002/ahm_ppt/Parallel_IO_MPI_2.ppt. This way of writing is > called "Accessing Irregularly Distributed Arrays" in this tutorial and the > example is given in page 42. > > I tested my code with mvapich and got the result as expected. But when I > tested it with openmpi, it didn't work. I tried the version 1.2.8 and 1.4 > and both didn't work. I tried two clusters. Both of them are intel chips > (woodcrest and nehalem), DDR infiniband with Linux system. I got some error > message like > > +++ > [n0883:08251] *** Process received signal *** > [n0883:08249] *** Process received signal *** > [n0883:08249] Signal: Segmentation fault (11) > [n0883:08249] Signal code: Address not mapped (1) > [n0883:08249] Failing at address: (nil) > [n0883:08251] Signal: Segmentation fault (11) > [n0883:08251] Signal code: Address not mapped (1) > [n0883:08251] Failing at address: (nil) > [n0883:08248] *** Process received signal *** > [n0883:08250] *** Process received signal *** > [n0883:08248] Signal: Segmentation fault (11) > [n0883:08248] Signal code: Address not mapped (1) > [n0883:08248] Failing at address: (nil) > [n0883:08250] Signal: Segmentation fault (11) > [n0883:08250] Signal code: Address not mapped (1) > [n0883:08250] Failing at address: (nil) > [n0883:08251] [ 0] /lib64/libpthread.so.0 [0x2b4f0a2f0d60] > +++ > > > > My testing code is here: > > === > program test_MPI_write_adv2 > > > !-- Template for any mpi program > > implicit none > > !--Include the mpi header file > include 'mpif.h' ! --> Required statement > > !--Declare all variables and arrays. > integer :: fh, ierr, myid, numprocs, itag, etype, filetype, info > integer :: status(MPI_STATUS_SIZE) > integer :: irc, ip > integer(kind=mpi_offset_kind) :: offset, disp > integer :: i, j, k > > integer :: num > > character(len=64) :: filename > > real(8), pointer :: q(:), temp(:) > integer, pointer :: map(:) > integer :: single_no, count > > > !--Initialize MPI > call MPI_INIT( ierr ) ! --> Required statement > > !--Who am I? --- get my rank=myid > call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) > > !--How many processes in the global group? > call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) > > if ( myid == 0 ) then > single_no = 4 > elseif ( myid == 1 ) then > single_no = 2 > elseif ( myid == 2 ) then > single_no = 2 > elseif ( myid == 3 ) then > single_no = 3 > else > single_no = 0 > end if > > if (single_no .gt. 0) allocate(map(single_no)) > > if ( myid == 0 ) then > map = (/ 0, 2, 5, 6 /) > elseif ( myid == 1 ) then > map = (/ 1, 4 /) > elseif ( myid == 2 ) then > map = (/ 3, 9 /) > elseif ( myid == 3 ) then > map = (/ 7, 8, 10 /) > end if > > if (single_no .gt. 0) allocate(q(single_no)) > > if (single_no .gt. 0) then > do i = 1,single_no > q(i) = dble(myid+1)*100.0d0 + dble(map(i)+1) > end do > end if > > if (single_no .gt. 0) map = map + 1 > > if ( myid == 0 ) then > count = 1 > else > count = 0 > end if > > if (count .gt. 0) then > allocate(temp(count)) > temp(1) = 2122010.0d0 > end if > > write(filename,'(a)') 'test_write.bin' > > call MPI_FI
Re: [OMPI users] Non-homogeneous Cluster Implementation
Hi I'm not sure i completely understood. Is it the case that an application compiled on the dell will not work on the PS3 and vice versa? If this is the case, you could try this: shell$ mpirun -np 1 --host a app_ps3 : -np 1 --host b app_dell where app_ps3 is your application compiled on the PS3 and a is your PS3 host, and app_dell is your application compiled on the dell, and b is your dell host. Check the MPI FAQs http://www.open-mpi.org/faq/?category=running#mpmd-run http://www.open-mpi.org/faq/?category=running#mpirun-host Hope this helps Jody On Thu, Jan 28, 2010 at 3:08 AM, Lee Manko wrote: > OK, so please stop me if you have heard this before, but I couldn’t find > anything in the archives that addressed my situation. > > > > I have a Beowulf cluster where ALL the node are PS3s running Yellow Dog > Linux 6.2 and a host (server) that is a Dell i686 Quad-core running Fedora > Core 12. After a failed attempt at letting yum install openmpi, I > downloaded v1.4.1, compiled and installed on all machines (PS3s and > Dell). I have an NSF shared directory on the host where the application > resides after building. All nodes have access to the shared volume and they > can see any files in the shared volume. > > > > I wrote a very simple master/slave application where the slave does a simple > computation and gets the processor name. The slave returns both pieces of > information to the master who then simply displays it in the terminal > window. After the slaves work on 1024 such tasks, the master exists. > > > > When I run on the host, without distributing to the nodes, I use the > command: > > > > “mpirun –np 4 ./MPI_Example” > > > > Compiling and running the application on the native hardware works perfectly > (ie: compiled and run on the PS3 or compiled and run on the Dell). > > > > However, when I went to scatter the tasks to the nodes, using the following > command, > > > > “mpirun –np 4 –hostfile mpi-hostfile ./MPI_Example” > > > > the application fails. I’m surmising that the issue is with running code > that was compiled for the Dell on the PS3 since the MPI_Init will launch the > application from the shared volume. > > > > So, I took the source code and compiled it on both the Dell and the PS3 and > placed the executables in /shared_volume/Dell and /shared_volume/PS3 and > added the paths to the environment variable PATH. I tried to run the > application from the host again using the following command, > > > > “mpirun –np 4 –hostfile mpi-hostfile –wdir > /shared_volume/PS3 ./MPI_Example” > > > > Hoping that the wdir would set the working directory at the time of the call > to MPI_Init() so that MPI_Init will launch the PS3 version of the > executable. > > > > I get the error: > > Could not execute the executable “./MPI_Example” : Exec format error > > This could mean that your PATH or executable name is wrong, or that you do > not > > have the necessary permissions. Please ensure that the executable is able > to be > > found and executed. > > > > Now, I know I’m gonna get some heat for this, but all of these machine use > only the root account with full root privileges, so it’s not a permission > issue. > > > > > > I am sure there is simple solution to my problem. Replacing the host with a > PS3 is not an option. Does anyone have any suggestions? > > > > Thanks. > > > > PS: When I get to programming the Cell BE, then I’ll use the IBM Cell SDK > with its cross-compiler toolchain. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] man-files not installed
Thanks, that did it! BTW, in the man page for mpirun you should perhaps mention the "!" option in xterm - the one that keeps the xterms open after the application exits. Thanks Jody On Mon, Dec 21, 2009 at 3:25 PM, Ralph Castain wrote: > Is your MANPATH set to point to /opt/openmpi/man? Check the order as well to > make sure that is first - could be an older install (like the system default) > is before it. > > On Dec 21, 2009, at 5:46 AM, jody wrote: > >> Hi >> I just installed open-mpi version 1.4, >> and now i noticed that the man-files are not properly installed. >> >> When i do >> man mpirun >> i get a different output than what is in >> openmpi/share/man/man1/mpirun.1 >> >> I installed with this configuration: >> ./configure --prefix=/opt/openmpi-1.4 --disable-mpi-f77 >> --disable-mpi-f90 --with-threads >> and afterwards made a soft link >> >> ln -s /opt/openmpi-1.4 /opt/openmpi >> >> This is on fedora fc8, but i have the same problem on my gentoo >> machines (2.6.29-gentoo-r5) >> Does anybody know how to get replace the old man files with the new ones? >> >> Thank You >> Jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] man-files not installed
Hi I just installed open-mpi version 1.4, and now i noticed that the man-files are not properly installed. When i do man mpirun i get a different output than what is in openmpi/share/man/man1/mpirun.1 I installed with this configuration: ./configure --prefix=/opt/openmpi-1.4 --disable-mpi-f77 --disable-mpi-f90 --with-threads and afterwards made a soft link ln -s /opt/openmpi-1.4 /opt/openmpi This is on fedora fc8, but i have the same problem on my gentoo machines (2.6.29-gentoo-r5) Does anybody know how to get replace the old man files with the new ones? Thank You Jody
Re: [OMPI users] Debugging spawned processes
Hi Ralph I finally got around to install version 1.4. The xterm works fine. And in order to get gdb going on the spawned processes, i need to add an argument "--args" in the argument list of the spawner so that the parameters of the spawned processes are getting through gdb. Thanks again Jody On Fri, Dec 18, 2009 at 10:46 PM, Ashley Pittman wrote: > On Wed, 2009-12-16 at 12:06 +0100, jody wrote: > >> Has anybody got some hints on how to debug spawned processes? > > If you can live with the processes starting normally and attaching gdb > to them after they have started then you could use padb. > > Assuming you only have one job active (replace -a with the job-id if you > don't) and watch to target the first spawned job then the following > command will launch an xterm for each rank in the job and automatically > attach to the process for you. > > padb -Oorte-job-step=2 --command -Ocommand="xterm -T %r -e 'gdb -p %p'" > -a > > You'll need to use the SVN version of padb for this, the "orte-job-step" > option tells it to attach to the first spawned job, use orte-ps to see > the list of job steps. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Debugging spawned processes
yeah, know that you mention it, i remember (old brain here, as well) But IIRC you created a OMPI version which was called 1.4a1r or something, where i indeed could use this xterm. When i updated to 1.3.2, i sort of forgot about it again... Another question though: You said "If it includes the -xterm option, then that option gets applied to the dynamically spawned procs too" Does this passing on also apply to the -x options? Thanks Jody On Wed, Dec 16, 2009 at 3:42 PM, Ralph Castain wrote: > It is in a later version - pretty sure it made 1.3.3. IIRC, I added it at > your request :-) > > On Dec 16, 2009, at 7:20 AM, jody wrote: > >> Thanks for your reply >> >> That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not >> to recognize the --xterm option. >> [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf >> -- >> mpirun was unable to launch the specified application as it could not >> find an executable: >> >> Executable: 1 >> Node: aim-plankton.uzh.ch >> >> while attempting to start process rank 0. >> -- >> (if i reverse the --xterm and -np 1, it complains about not finding >> executable '9') >> Do i need to install a higher version, or is this something i'd have >> to set as option in configure? >> >> Thank You >> Jody >> >> On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain wrote: >>> Depends on the version you are working with. If it includes the -xterm >>> option, then that option gets applied to the dynamically spawned procs too, >>> so this should be automatically taken care of...but in that case, you >>> wouldn't need your script to open an xterm anyway. You would just do: >>> >>> mpirun --xterm -np 5 gdb ./my_app >>> >>> or the equivalent. You would then comm_spawn an argv[0] of "gdb", with >>> argv[1] being your target app. >>> >>> I don't know how to avoid including that "gdb" in the comm_spawn argv's - I >>> once added an mpirun cmd line option to automatically add it, but got >>> loudly told to remove it. Of course, it should be easy to pass an option >>> to your app itself that tells it whether or not to do so! >>> >>> HTH >>> Ralph >>> >>> >>> On Dec 16, 2009, at 4:06 AM, jody wrote: >>> >>>> Hi >>>> Until now i always wrote applications for which the number of processes >>>> was given on the command line with -np. >>>> To debug these applications i wrote a script, run_gdb.sh which basically >>>> open a xterm and starts gdb in it for my application. >>>> This allowed me to have a window for each of the processes being debugged. >>>> >>>> Now, however, i write my first application in which additional processes >>>> are >>>> being spawned. My question is now: how can i open xterm windows in which >>>> gdb runs for the spawned processes? >>>> >>>> The only way i can think of is to pass my script run_gdb.sh into the argv >>>> parameters of MPI_Spawn. >>>> Would this be correct? >>>> If yes, what about other parameters passed to the spawning process, such as >>>> environment variables passed via -x? Are they being passed to the spawned >>>> processes as well? In my case this would be necessary so that processes >>>> on other machine will get the $DISPLAY environment variable in order to >>>> display their xterms with gdb on my workstation. >>>> >>>> Another negative point would be the need to change the argv parameters >>>> every time one switches between debugging and normal running. >>>> >>>> Has anybody got some hints on how to debug spawned processes? >>>> >>>> Thank You >>>> Jody >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Debugging spawned processes
Thanks for your reply That sounds good. I have Open-MPI version 1.3.2, and mpirun seems not to recognize the --xterm option. [jody@plankton tileopt]$ mpirun --xterm -np 1 ./boss 9 sample.tlf -- mpirun was unable to launch the specified application as it could not find an executable: Executable: 1 Node: aim-plankton.uzh.ch while attempting to start process rank 0. -- (if i reverse the --xterm and -np 1, it complains about not finding executable '9') Do i need to install a higher version, or is this something i'd have to set as option in configure? Thank You Jody On Wed, Dec 16, 2009 at 1:00 PM, Ralph Castain wrote: > Depends on the version you are working with. If it includes the -xterm > option, then that option gets applied to the dynamically spawned procs too, > so this should be automatically taken care of...but in that case, you > wouldn't need your script to open an xterm anyway. You would just do: > > mpirun --xterm -np 5 gdb ./my_app > > or the equivalent. You would then comm_spawn an argv[0] of "gdb", with > argv[1] being your target app. > > I don't know how to avoid including that "gdb" in the comm_spawn argv's - I > once added an mpirun cmd line option to automatically add it, but got loudly > told to remove it. Of course, it should be easy to pass an option to your > app itself that tells it whether or not to do so! > > HTH > Ralph > > > On Dec 16, 2009, at 4:06 AM, jody wrote: > >> Hi >> Until now i always wrote applications for which the number of processes >> was given on the command line with -np. >> To debug these applications i wrote a script, run_gdb.sh which basically >> open a xterm and starts gdb in it for my application. >> This allowed me to have a window for each of the processes being debugged. >> >> Now, however, i write my first application in which additional processes are >> being spawned. My question is now: how can i open xterm windows in which >> gdb runs for the spawned processes? >> >> The only way i can think of is to pass my script run_gdb.sh into the argv >> parameters of MPI_Spawn. >> Would this be correct? >> If yes, what about other parameters passed to the spawning process, such as >> environment variables passed via -x? Are they being passed to the spawned >> processes as well? In my case this would be necessary so that processes >> on other machine will get the $DISPLAY environment variable in order to >> display their xterms with gdb on my workstation. >> >> Another negative point would be the need to change the argv parameters >> every time one switches between debugging and normal running. >> >> Has anybody got some hints on how to debug spawned processes? >> >> Thank You >> Jody >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Debugging spawned processes
Hi Until now i always wrote applications for which the number of processes was given on the command line with -np. To debug these applications i wrote a script, run_gdb.sh which basically open a xterm and starts gdb in it for my application. This allowed me to have a window for each of the processes being debugged. Now, however, i write my first application in which additional processes are being spawned. My question is now: how can i open xterm windows in which gdb runs for the spawned processes? The only way i can think of is to pass my script run_gdb.sh into the argv parameters of MPI_Spawn. Would this be correct? If yes, what about other parameters passed to the spawning process, such as environment variables passed via -x? Are they being passed to the spawned processes as well? In my case this would be necessary so that processes on other machine will get the $DISPLAY environment variable in order to display their xterms with gdb on my workstation. Another negative point would be the need to change the argv parameters every time one switches between debugging and normal running. Has anybody got some hints on how to debug spawned processes? Thank You Jody
Re: [OMPI users] Open MPI Query
Hi >> 2) Does MPI_Send() and MPI_Recv() calls send message from process on >> one machine to >> process on another machine ? If yes, then how can I achieve this ? > > Take a look at what the example codes are doing. Read man mpirun. Wait > for someone here to point you to an MPI primer or tute. > Have a look at the Open MPI FAQ: http://www.open-mpi.org/faq/?category=running It shows you how to run a Open-MPI program on single or multiple machines Jody
Re: [OMPI users] OMPI-1.2.0 is not getting installed
Sorry, i can't help you here. I have no experience with neither intel compilers nor IB Jody On Wed, Oct 21, 2009 at 4:14 AM, Sangamesh B wrote: > > > On Tue, Oct 20, 2009 at 6:48 PM, jody wrote: >> >> Hi >> Just curious: >> Is there a particular reason why you want version 1.2? > > Yes. Our cluster is installed with Intel MKL-10.0. This version of MKL > contains a static blacs library which is compatible with OMPI-1.2 as told by > Intel support team. > > http://software.intel.com/en-us/forums/intel-math-kernel-library/topic/69104/ > > Is it possible to get it installed? > Thanks, > Sangamesh >> >> The current version is 1.3.3! >> >> Jody >> >> On Tue, Oct 20, 2009 at 2:48 PM, Sangamesh B wrote: >> > Hi, >> > >> > Its required here to install Open MPI 1.2 on a HPC cluster with - >> > Cent >> > OS 5.2 Linux, Mellanox IB card, switch and OFED-1.4. >> > But the configure is failing with: >> > >> > [root@master openmpi-1.2]# ./configure >> > --prefix=/opt/mpi/openmpi/1.2/intel >> > --with-openib=/usr >> > .. >> > ... >> > >> > --- MCA component btl:openib (m4 configuration macro) >> > checking for MCA component btl:openib compile mode... dso >> > checking for sysfs_open_class in -lsysfs... no >> > configure: error: OpenIB support requested but required sysfs not found. >> > Aborting >> > >> > even though the required rpms are available: >> > >> > # rpm -qa | grep sysfs >> > sysfsutils-2.0.0-6 >> > libsysfs-2.0.0-6 >> > libsysfs-2.0.0-6 >> > >> > >> > What to do get install OMPI-1.2 specifically? >> > >> > Thanks >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] OMPI-1.2.0 is not getting installed
Hi Just curious: Is there a particular reason why you want version 1.2? The current version is 1.3.3! Jody On Tue, Oct 20, 2009 at 2:48 PM, Sangamesh B wrote: > Hi, > > Its required here to install Open MPI 1.2 on a HPC cluster with - Cent > OS 5.2 Linux, Mellanox IB card, switch and OFED-1.4. > But the configure is failing with: > > [root@master openmpi-1.2]# ./configure --prefix=/opt/mpi/openmpi/1.2/intel > --with-openib=/usr > .. > ... > > --- MCA component btl:openib (m4 configuration macro) > checking for MCA component btl:openib compile mode... dso > checking for sysfs_open_class in -lsysfs... no > configure: error: OpenIB support requested but required sysfs not found. > Aborting > > even though the required rpms are available: > > # rpm -qa | grep sysfs > sysfsutils-2.0.0-6 > libsysfs-2.0.0-6 > libsysfs-2.0.0-6 > > > What to do get install OMPI-1.2 specifically? > > Thanks > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] how to set up the cluster of 5 nodes in openmpi
Hi All of your questions are answered in the FAQ... If you have a TCP/IP connection between your machines so that each machine can reach every other one, that will be ok. First make sure you can get access from each machine to every other one using ssh without a password. See the FAQ: http://www.open-mpi.org/faq/?category=rsh Make sure to set PATH and LD_LIBRARY_PATH as described in the FAQ: http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path Next, make sure your application is accessible by all of your machines. I use an nfs directory shared by all my machines, and that is where i put the application. To start your application, follow the instructions in the FAQ: http://www.open-mpi.org/faq/?category=running If you want to use host files, read about how to use them in the FAQ: http://www.open-mpi.org/faq/?category=running#mpirun-host Hope that helps Jody On Wed, Sep 30, 2009 at 11:00 AM, ankur pachauri wrote: > Dear all, > > I have been able to install open mpi on two independent machines having FC > 10. The simple hello world programms are running fine on the independent > machinesBut can any one pls help me by letting me know how to connect > the two machines and run a common program between the twohow do we a do > a lamboot -v lamhosts in case of openmpi? > How do we get the open mpi running on the two computers simultaneously and > excute a common program on the two machines. > > Thanks in advance > > > On Wed, Sep 30, 2009 at 12:24 PM, jody wrote: >> >> Hi >> Have look at the Open MPI FAQ: >> >> http://www.open-mpi.org/faq/ >> >> It gives you all the information you need to start working with your >> cluster. >> >> Jody >> >> >> On Wed, Sep 30, 2009 at 8:25 AM, ankur pachauri >> wrote: >> > dear all, >> > >> > i am new to openmpi, all that i need is to set up the cluster of around >> > 5 >> > nodes in my lab, i am using fedora 7 in the lab. so i'll be thankfull to >> > you >> > if let me know the steps or the procedure to setup the cluster(as in >> > case of >> > lam/mpi- passwordless ssh or nfs mount and ...). >> > >> > regards, >> > >> > -- >> > Ankur Pachauri. >> > 09927590910 >> > >> > Research Scholar, >> > software engineering. >> > Department of Mathematics >> > Dayalbagh Educational Institute >> > Dayalbagh, >> > AGRA >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Ankur Pachauri. > 09927590910 > > Research Scholar, > software engineering. > Department of Mathematics > Dayalbagh Educational Institute > Dayalbagh, > AGRA > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] how to set up the cluster of 5 nodes in openmpi
Hi Have look at the Open MPI FAQ: http://www.open-mpi.org/faq/ It gives you all the information you need to start working with your cluster. Jody On Wed, Sep 30, 2009 at 8:25 AM, ankur pachauri wrote: > dear all, > > i am new to openmpi, all that i need is to set up the cluster of around 5 > nodes in my lab, i am using fedora 7 in the lab. so i'll be thankfull to you > if let me know the steps or the procedure to setup the cluster(as in case of > lam/mpi- passwordless ssh or nfs mount and ...). > > regards, > > -- > Ankur Pachauri. > 09927590910 > > Research Scholar, > software engineering. > Department of Mathematics > Dayalbagh Educational Institute > Dayalbagh, > AGRA > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_Irecv segmentation fault
Did you also change the "&buffer" to buffer in your MPI_Send call? Jody On Tue, Sep 22, 2009 at 1:38 PM, Everette Clemmer wrote: > Hmm, tried changing MPI_Irecv( &buffer) to MPI_Irecv( buffer...) > and still no luck. Stack trace follows if that's helpful: > > prompt$ mpirun -np 2 ./display_test_debug > Sending 'q' from node 0 to node 1 > [COMPUTER:50898] *** Process received signal *** > [COMPUTER:50898] Signal: Segmentation fault (11) > [COMPUTER:50898] Signal code: (0) > [COMPUTER:50898] Failing at address: 0x0 > [COMPUTER:50898] [ 0] 2 libSystem.B.dylib > 0x7fff87e280aa _sigtramp + 26 > [COMPUTER:50898] [ 1] 3 ??? > 0x 0x0 + 0 > [COMPUTER:50898] [ 2] 4 GLUT > 0x000100024a21 glutMainLoop + 261 > [COMPUTER:50898] [ 3] 5 display_test_debug > 0x00011444 xsMainLoop + 67 > [COMPUTER:50898] [ 4] 6 display_test_debug > 0x00011335 main + 59 > [COMPUTER:50898] [ 5] 7 display_test_debug > 0x00010d9c start + 52 > [COMPUTER:50898] [ 6] 8 ??? > 0x0001 0x0 + 1 > [COMPUTER:50898] *** End of error message *** > mpirun noticed that job rank 0 with PID 50897 on node COMPUTER.local > exited on signal 15 (Terminated). > 1 additional process aborted (not shown) > > Thanks, > Everette > > > On Tue, Sep 22, 2009 at 2:28 AM, Ake Sandgren > wrote: >> On Mon, 2009-09-21 at 19:26 -0400, Everette Clemmer wrote: >>> Hey all, >>> >>> I'm getting a segmentation fault when I attempt to receive a single >>> character via MPI_Irecv. Code follows: >>> >>> void recv_func() { >>> if( !MASTER ) { >>> char buffer[ 1 ]; >>> int flag; >>> MPI_Request request; >>> MPI_Status status; >>> >>> MPI_Irecv( &buffer, 1, MPI_CHAR, 0, MPI_ANY_TAG, >>> MPI_COMM_WORLD, &request); >> >> It should be MPI_Irecv(buffer, 1, ...) >> >>> The segfault disappears if I comment out the MPI_Irecv call in >>> recv_func so I'm assuming that there's something wrong with the >>> parameters that I'm passing to it. Thoughts? >> >> -- >> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden >> Internet: a...@hpc2n.umu.se Phone: +46 90 7866134 Fax: +46 90 7866126 >> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > -- > - Everette > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Timers
Hi I'm not sure if i completely understand your requirements, but have you tried MPI_WTime? Jody On Fri, Sep 11, 2009 at 7:54 AM, amjad ali wrote: > Hi all, > I want to get the elapsed time from start to end of my parallel program > (OPENMPI based). It should give same time for the same problem always; > irrespective of whether the nodes are running some or programs or they are > running only that program. How to do this? > > Regards. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Problem with linking on OS X
Hi Tomek, Did you try mpicc --showme? I get: gcc -D_REENTRANT -I/Network/Xgrid/openmpi/include -L/Network/Xgrid/ openmpi/lib -lmpi -lopen-rte -lopen-pal -lutil If your -L isn't correct in there, then I would guess your configuration found the wrong library somehow when you compiled mpicc and friends... Cheers, Jody On Aug 19, 2009, at 15:57 PM, tomek wrote: OK - I have fixed it by including -L/opt/openmpi/lib at the very beginning of mpicc ... -L/opt/openmpi/lib -o app.exe the rest ... But something is wrong with dyld anyhow. On 19 Aug 2009, at 21:04, Jody Klymak wrote: Hi Tomek, I'm using 10.5.7, and just went through a painful process that we thought was library related (but it wasn't), so I'll give my less- than-learned response, and if you sill have difficulties hopefully others will chime in: What is the result of "which mpicc" (or whatever you are using for your compiling/linking? I'm pretty sure that's where the library paths get set, and if you are calling /usr/bin/mpicc you will get the wrong library paths in the executable. On Aug 19, 2009, at 10:57 AM, tomek wrote: ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] Problem with linking on OS X
Hi Tomek, I'm using 10.5.7, and just went through a painful process that we thought was library related (but it wasn't), so I'll give my less-than- learned response, and if you sill have difficulties hopefully others will chime in: What is the result of "which mpicc" (or whatever you are using for your compiling/linking? I'm pretty sure that's where the library paths get set, and if you are calling /usr/bin/mpicc you will get the wrong library paths in the executable. On Aug 19, 2009, at 10:57 AM, tomek wrote: 1. Using DYLD_LIBRARY_PATH 2. passing some ./configure --with-wrapper-ldflags="-L/opt/openmpi/ lib" 3. passing some ./configure --with-wrapper-ldflags="-rpath /opt/ openmpi/lib" 4. hand compilation with cc -L/opt/openmpi/lib -lmpi 2 and 3 did not work (ld error=22) With 1 and 2 my code still gets linked with /usr/lib/libmpi... Note, that the /opt/openmpi/bin path is properly set and ompi_info does outputs the right info. You do not need to set DYLD_LIBRARY_PATH. I don't have it set and my mpi applications run fine. Did 4 work? Cheers, Jody -- Jody Klymak http://web.uvic.ca/~jklymak/
Re: [OMPI users] --rankfile
Hi I had a similar problem. Following a suggestion from Lenny, i removed the "max-slots" entries from my hostsfile and it worked. It seems that there still are some minor bugs in the rankfile mechanism. See the post http://www.open-mpi.org/community/lists/users/2009/08/10384.php Jody On Tue, Aug 18, 2009 at 10:53 PM, Nulik Nol wrote: > Hi, > i get this error when i use --rankfile, > "There are not enough slots available in the system to satisfy the 2 slots" > what could be the problem? I have tried using '*' for 'slot' param and > many other configs without any luck. Wihtout --rankfile everything > works fine. Will appriciate any help. > > master waver # cat neat.hostfile > n64 max-slots=1 slots=1 > master max-slots=1 slots=1 > master waver # cat neat.rankfile > rank 0=n64 slot=0 > rank 1=master slot=0 > master waver # mpirun --rankfile neat.rankfile --hostfile > neat.hostfile -n 2 /tmp/neat > -- > There are not enough slots available in the system to satisfy the 2 slots > that were requested by the application: > /tmp/neat > > Either request fewer slots for your application, or make more slots available > for use. > > -- > -- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > mpirun: clean termination accomplished > > master waver # mpirun --hostfile neat.hostfile -n 2 /tmp/neat > entering master main loop > recieved msg from 1 > unknown message 0 > ^Cmpirun: killing job... > > -- > mpirun noticed that process rank 1 with PID 13064 on node master > exited on signal 0 (Unknown signal 0). > -- > 2 total processes killed (some possibly by mpirun during cleanup) > mpirun: clean termination accomplished > > master waver # > > > -- > == > The power of zero is infinite > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] rank file error: Rankfile claimed...
Hi Lenny After removing the max-slots entries, i could do mpirun -np 4 -hostfile th_02 -rf rf_02 ./HelloMPI without any errors. But can you explain what the meaning of the max-slots entry is? I checked the FAQs http://www.open-mpi.org/faq/?category=running#simple-spmd-run http://www.open-mpi.org/faq/?category=running#mpirun-scheduling but i couldn't find any explanation. (furthermore, in the FAQ it says "max-slots" in one place, but "max_slots" in the other one) Thank You Jody On Mon, Aug 17, 2009 at 3:29 PM, Lenny Verkhovsky wrote: > can you try not specifiyng "max-slots" in the hostfile. > if you are the only user of the nodes, there will be no oversibscibing of > the processors. > This one definetly looks like a bug, > but as Ralph said there is a current disscusion and working on this > component. > Lenny. > On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain wrote: >>> >>> Is there an explanation for this? >> >> I believe the word is "bug". :-) >> >> The rank_file mapper has been substantially revised lately - we are >> discussing now how much of that revision to bring to 1.3.4 versus the next >> major release. >> >> Ralph >> >> On Aug 17, 2009, at 4:45 AM, jody wrote: >> >>> Hi Lenny >>> >>>> I think it has something to do with your environment, /etc/hosts, IT >>>> setup, >>>> hostname function return value e.t.c >>>> I am not sure if it has something to do with Open MPI at all. >>> >>> OK. I just thought this was Open MPI related because i was able to use >>> the >>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in >>> the host file... >>> >>> However, I encountered a new problem: >>> if the rankfile lists all the entries which occur in the host file >>> there is an error message. >>> In the following example, the hostfile is >>> [jody@plankton neander]$ cat th_02 >>> nano_00.uzh.ch slots=2 max-slots=2 >>> nano_02.uzh.ch slots=2 max-slots=2 >>> >>> and the rankfile is: >>> [jody@plankton neander]$ cat rf_02 >>> rank 0=nano_00.uzh.ch slot=0 >>> rank 2=nano_00.uzh.ch slot=1 >>> rank 1=nano_02.uzh.ch slot=0 >>> rank 3=nano_02.uzh.ch slot=1 >>> >>> Here is the error: >>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 >>> ./HelloMPI >>> >>> -- >>> There are not enough slots available in the system to satisfy the 4 slots >>> that were requested by the application: >>> ./HelloMPI >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> >>> >>> -- >>> >>> -- >>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>> launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>> the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> >>> -- >>> >>> -- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> >>> -- >>> mpirun: clean termination accomplished >>> >>> If i use a hostfile with one more entry >>> [jody@aim-plankton neander]$ cat th_021 >>> aim-nano_00.uzh.ch slots=2 max-slots=2 >>> aim-nano_02.uzh.ch slots=2 max-slots=2 >>> aim-nano_01.uzh.ch slots=1 max-slots=1 >>> >>> Then this works fine: >>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf rf_02 >>> ./HelloMPI >>> >>> Is there an explanation for this? >>> >>> Thank You >>> Jody >>> >>>> Lenny. >>>> On Mon, Aug 17, 2009 at 12:59 PM, jody wrote: >>
Re: [OMPI users] rank file error: Rankfile claimed...
Hi Lenny > I think it has something to do with your environment, /etc/hosts, IT setup, > hostname function return value e.t.c > I am not sure if it has something to do with Open MPI at all. OK. I just thought this was Open MPI related because i was able to use the aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in the host file... However, I encountered a new problem: if the rankfile lists all the entries which occur in the host file there is an error message. In the following example, the hostfile is [jody@plankton neander]$ cat th_02 nano_00.uzh.ch slots=2 max-slots=2 nano_02.uzh.ch slots=2 max-slots=2 and the rankfile is: [jody@plankton neander]$ cat rf_02 rank 0=nano_00.uzh.ch slot=0 rank 2=nano_00.uzh.ch slot=1 rank 1=nano_02.uzh.ch slot=0 rank 3=nano_02.uzh.ch slot=1 Here is the error: [jody@plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 ./HelloMPI -- There are not enough slots available in the system to satisfy the 4 slots that were requested by the application: ./HelloMPI Either request fewer slots for your application, or make more slots available for use. -- -- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished If i use a hostfile with one more entry [jody@aim-plankton neander]$ cat th_021 aim-nano_00.uzh.ch slots=2 max-slots=2 aim-nano_02.uzh.ch slots=2 max-slots=2 aim-nano_01.uzh.ch slots=1 max-slots=1 Then this works fine: [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf rf_02 ./HelloMPI Is there an explanation for this? Thank You Jody > Lenny. > On Mon, Aug 17, 2009 at 12:59 PM, jody wrote: >> >> Hi Lenny >> >> Thanks - using the full names makes it work! >> Is there a reason why the rankfile option treats >> host names differently than the hostfile option? >> >> Thanks >> Jody >> >> >> >> On Mon, Aug 17, 2009 at 11:20 AM, Lenny >> Verkhovsky wrote: >> > Hi >> > This message means >> > that you are trying to use host "plankton", that was not allocated via >> > hostfile or hostlist. >> > But according to the files and command line, everything seems fine. >> > Can you try using "plankton.uzh.ch" hostname instead of "plankton". >> > thanks >> > Lenny. >> > On Mon, Aug 17, 2009 at 10:36 AM, jody wrote: >> >> >> >> Hi >> >> >> >> When i use a rankfile, i get an error message which i don't understand: >> >> >> >> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts >> >> ./HelloMPI >> >> >> >> -- >> >> Rankfile claimed host plankton that was not allocated or >> >> oversubscribed it's slots: >> >> >> >> >> >> -- >> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in >> >> file rmaps_rank_file.c at line 108 >> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in >> >> file base/rmaps_base_map_job.c at line 87 >> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in >> >> file base/plm_base_launch_support.c at line 77 >> >> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter in >> >> file plm_rsh_module.c at line 990 >> >> >> >> -- >> >> A daemon (pid unknown) died unexpectedly on signal 1 while attempting >> >> to >> >> launch so we are aborting. >> >> >> >> There may be more information reported by the envir