Re: [OMPI users] users Digest, Vol 1052, Issue 10
Sent by: To users-bounces@ope Open MPI Users <us...@open-mpi.org> n-mpi.org cc Subject 10/31/2008 03:38 [OMPI users] problem running Open PMMPI on Cells Please respond to Open MPI Users <users@open-mpi.o rg> Hello, I'm having problems using Open MPI on a cluster of Mercury Computer's Cell Accelerator Boards (CABs). We have an MPI application that is running on multiple CABs. The application uses Mercury's MultiCore Framework (MCF) to use the Cell's SPEs. Here's the basic problem. I can log into each CAB and run the application in serial directly from the command line (i.e. without using mpirun) without a problem. I can also launch a serial job onto each CAB from another machine using mpirun without a problem. The problem occurs when I try to launch onto multiple CABs using mpirun. MCF requires a license file. After the application initializes MPI, it tries to initialized MCF on each node. The initialization routine loads the MCF license file and checks for valid license keys. If the keys are valid, then it continues to initialize MCF. If not, it throws an error. When I run on multiple CABs, most of the time several of the CABs throw an error saying MCF cannot find a valid license key. The strange this is that this behavior doesn't appear when I launch serial jobs using MCF, only multiple CABs. Additionally, the errors are inconsistent. Not all the CABs throw an error, sometimes a few of them error out, sometimes all of them, sometimes none. I've talked with the Mercury folks and they're just as stumped as I am. The only thing we can think of is that OpenMPI is somehow modifying the environment and is interfering with MCF, but we can't think of any reason why. Any ideas out there? Thanks. Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- next part -- HTML attachment scrubbed and removed -- next part -- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: <http://www.open-mpi.org/MailArchives/users/attachments/20081031/2d67d208/attachment.gif> -- next part -- A non-text attachment was scrubbed... Name: pic18585.gif Type: image/gif Size: 1255 bytes Desc: not available URL: <http://www.open-mpi.org/MailArchives/users/attachments/20081031/2d67d208/attachment-0001.gif> ------ next part -- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: <http://www.open-mpi.org/MailArchives/users/attachments/20081031/2d67d208/attachment-0002.gif> -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users End of users Digest, Vol 1052, Issue 10 ***
Re: [OMPI users] problem running Open MPI on Cells
Where did you put the environment variable related to MCF licence file and MCF share libraries? What is your default shell? Did you test indicate the following? Suppose you have 4 nodes, on node 1, " mpirun -np 4 --host node1,node2,node3,node4 hostname" works, but "mpirun -np4 --host node1,node2,node3,node4 foocbe" does not work, where foocbe is executable generated with MCF. It is possible that MCF license is limited to a few concurrent use? e.g. the license is limited to 4 current use, and mpi application will fails on 8 nodes? Regards, Mi Hahn KimSent by: To users-bounces@ope Open MPI Users n-mpi.org cc Subject 10/31/2008 03:38 [OMPI users] problem running Open PMMPI on Cells Please respond to Open MPI Users Hello, I'm having problems using Open MPI on a cluster of Mercury Computer's Cell Accelerator Boards (CABs). We have an MPI application that is running on multiple CABs. The application uses Mercury's MultiCore Framework (MCF) to use the Cell's SPEs. Here's the basic problem. I can log into each CAB and run the application in serial directly from the command line (i.e. without using mpirun) without a problem. I can also launch a serial job onto each CAB from another machine using mpirun without a problem. The problem occurs when I try to launch onto multiple CABs using mpirun. MCF requires a license file. After the application initializes MPI, it tries to initialized MCF on each node. The initialization routine loads the MCF license file and checks for valid license keys. If the keys are valid, then it continues to initialize MCF. If not, it throws an error. When I run on multiple CABs, most of the time several of the CABs throw an error saying MCF cannot find a valid license key. The strange this is that this behavior doesn't appear when I launch serial jobs using MCF, only multiple CABs. Additionally, the errors are inconsistent. Not all the CABs throw an error, sometimes a few of them error out, sometimes all of them, sometimes none. I've talked with the Mercury folks and they're just as stumped as I am. The only thing we can think of is that OpenMPI is somehow modifying the environment and is interfering with MCF, but we can't think of any reason why. Any ideas out there? Thanks. Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem with openmpi version 1.3b1 beta1
I see you are using IPv6. From what I can tell, we do enable that support by default if the underlying system supports it. My best guess is that either that support is broken (we never test it since none of us use IPv6), or our configure system isn't properly detecting that it exists. Can you attach a copy of your config.log? It will tell us what the system thinks it should be building. Thanks Ralph On Oct 31, 2008, at 4:54 PM, Allan Menezes wrote: Date: Fri, 31 Oct 2008 09:34:52 -0600 From: Ralph CastainSubject: Re: [OMPI users] users Digest, Vol 1052, Issue 1 To: Open MPI Users Message-ID: <0cf28492-b13e-4f82-ac43-c1580f079...@lanl.gov> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; DelSp="yes" It looks like the daemon isn't seeing the other interface address on host x2. Can you ssh to x2 and send the contents of ifconfig -a? Ralph On Oct 31, 2008, at 9:18 AM, Allan Menezes wrote: users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Openmpi ver1.3beta1 (Allan Menezes) 2. Re: Openmpi ver1.3beta1 (Ralph Castain) 3. Re: Equivalent .h files (Benjamin Lamptey) 4. Re: Equivalent .h files (Jeff Squyres) 5. ompi-checkpoint is hanging (Matthias Hovestadt) 6. unsubscibe (Bertrand P. S. Russell) 7. Re: ompi-checkpoint is hanging (Tim Mattox) -- Message: 1 Date: Fri, 31 Oct 2008 02:06:09 -0400 From: Allan Menezes Subject: [OMPI users] Openmpi ver1.3beta1 To: us...@open-mpi.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes -- Message: 2 Date: Fri, 31 Oct 2008 02:41:59 -0600 From: Ralph Castain Subject: Re: [OMPI users] Openmpi ver1.3beta1 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes When you typed the --host x1 command, were you sitting on x1? Likewise, when you typed the --host x2 command, were you not on host x2? If the answer to both questions is "yes", then my guess is that something is preventing you from launching a daemon on host x2. Try adding --leave-session-attached to your cmd line and see if any error messages appear. And check the FAQ for tips on how to setup for ssh launch (I'm assuming that is what you are using). http://www.open-mpi.org/faq/?category=rsh Ralph On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote: Hi Ralph, Yes that is true I tried both commands on x1 and ver 1.28 works on the same setup without a problem. Here is the output with the added --leave-session-attached [allan@x1 ~]$ mpiexec --prefix /opt/openmpi13b2 --leave-session- attached -host x2 hostname [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.0.198:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.122.1:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1] routed:binomial: Connection to lifeline [[1354,0],0] lost -- A daemon (pid 7665) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.
[OMPI users] Problem with openmpi version 1.3b1 beta1
List-Post: users@lists.open-mpi.org Date: Fri, 31 Oct 2008 09:34:52 -0600 From: Ralph CastainSubject: Re: [OMPI users] users Digest, Vol 1052, Issue 1 To: Open MPI Users Message-ID: <0cf28492-b13e-4f82-ac43-c1580f079...@lanl.gov> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; DelSp="yes" It looks like the daemon isn't seeing the other interface address on host x2. Can you ssh to x2 and send the contents of ifconfig -a? Ralph On Oct 31, 2008, at 9:18 AM, Allan Menezes wrote: users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Openmpi ver1.3beta1 (Allan Menezes) 2. Re: Openmpi ver1.3beta1 (Ralph Castain) 3. Re: Equivalent .h files (Benjamin Lamptey) 4. Re: Equivalent .h files (Jeff Squyres) 5. ompi-checkpoint is hanging (Matthias Hovestadt) 6. unsubscibe (Bertrand P. S. Russell) 7. Re: ompi-checkpoint is hanging (Tim Mattox) -- Message: 1 Date: Fri, 31 Oct 2008 02:06:09 -0400 From: Allan Menezes Subject: [OMPI users] Openmpi ver1.3beta1 To: us...@open-mpi.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes -- Message: 2 Date: Fri, 31 Oct 2008 02:41:59 -0600 From: Ralph Castain Subject: Re: [OMPI users] Openmpi ver1.3beta1 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes When you typed the --host x1 command, were you sitting on x1? Likewise, when you typed the --host x2 command, were you not on host x2? If the answer to both questions is "yes", then my guess is that something is preventing you from launching a daemon on host x2. Try adding --leave-session-attached to your cmd line and see if any error messages appear. And check the FAQ for tips on how to setup for ssh launch (I'm assuming that is what you are using). http://www.open-mpi.org/faq/?category=rsh Ralph On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote: Hi Ralph, Yes that is true I tried both commands on x1 and ver 1.28 works on the same setup without a problem. Here is the output with the added --leave-session-attached [allan@x1 ~]$ mpiexec --prefix /opt/openmpi13b2 --leave-session- attached -host x2 hostname [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.0.198:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.122.1:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1] routed:binomial: Connection to lifeline [[1354,0],0] lost -- A daemon (pid 7665) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. -- mpiexec: clean termination accomplished [allan@x1 ~]$ However my main eth0 IP is 192.168.1.1 and internet gate way is 192.168.0.1 Any solutions? Allan Menezes
Re: [OMPI users] Working with a CellBlade cluster
OK, thanks to Mi and Jeff for their useful replies anyway. Gilbert. On Fri, 31 Oct 2008, Jeff Squyres wrote: > AFAIK, there are no parameters available to monitor IB message passing. The > majority of it is processed in hardware, and Linux is unaware of it. We have > not added any extra instrumentation into the openib BTL to provide auditing > information, because, among other reasons, that is the performance-critical > code path and we didn't want to add any latency in there. > > The best you may be able to do is with a PMPI-based library to audit MPI > function call invocations. > > > On Oct 31, 2008, at 4:07 PM, Mi Yan wrote: > > > Gilbert, > > > > I did not know the MCA parameters that can monitor the message passing. I > > have tried a few MCA verbose parameters and did not identify anyone helpful. > > > > One way to check if the message goes via IB or SM maybe to check the > > counters in /sys/class/infiniband. > > > > Regards, > > Mi > > Gilbert Grosdidier> > > > > > Gilbert Grosdidier > > Sent by: users-boun...@open-mpi.org > > 10/29/2008 12:36 PM > > Please respond to > > Open MPI Users > > > > To > > > > Open MPI Users > > > > cc > > > > > > Subject > > > > Re: [OMPI users] Working with a CellBlade cluster > > > > > > > > Thank you very much Mi and Lenny for your detailed replies. > > > > I believe I can summarize the infos to allow for > > 'Working with a QS22 CellBlade cluster' like this: > > - Yes, messages are efficiently handled with "-mca btl openib,sm,self" > > - Better to go to the OMPI-1.3 version ASAP > > - It is currently more efficient/easy to use numactl to control > > processor affinity on a QS22. > > > > So far so good. > > > > One question remains: how could I monitor in details message passing > > thru IB (on one side) and thru SM (on the other side) thru the use of mca > > parameters, please ? Additionnal info about the verbosity level > > of this monitoring will be highly appreciated ... A lengthy travel > > inside the list of such parameters provided by ompi_info did not > > enlighten me (there are so many xxx_sm_yyy type params that I don't know > > which > > could be the right one ;-) > > > > Thanks in advance for your hints, Best Regards, Gilbert. > > > > > > On Thu, 23 Oct 2008, Mi Yan wrote: > > > > > > > > 1. MCA BTL parameters > > > With "-mca btl openib,self", both message between two Cell processors on > > > one QS22 and messages between two QS22s go through IB. > > > > > > With "-mca btl openib,sm,slef", message on one QS22 go through shared > > > memory, message between QS22 go through IB, > > > > > > Depending on the message size and other MCA parameters, it does not > > > guarantee message passing on shared memory is faster than on IB. E.g. > > > the bandwidth for 64KB message is 959MB/s on shared-memory and is 694MB/s > > > on IB; the bandwidth for 4MB message is 539 MB/s and 1092 MB/s on IB. > > > The bandwidth of 4MB message on shared memory may be higher if you tune > > > some MCA parameter. > > > > > > 2. mpi_paffinity_alone > > > "mpi_paffinity_alone =1" is not a good choice for QS22. There are two > > > sockets with two physical Cell/B.E. on one QS22. Each Cell/B.E. has two > > > SMT threads. So there are four logical CPUs on one QS22. CBE Linux > > > kernel maps logical cpu 0 and 1 to socket1 and maps logical cpu 1 and 2 to > > > socket 2.If mpi_paffinity_alone is set to 1, the two MPI instances > > > will be assigned to logical cpu 0 and cpu 1 on socket 1. I believe this > > is > > > not what you want. > > > > > > A temporaily solution to force the affinity on QS22 is to use > > > "numactl", E.g. assuming the hostname is "qs22" and the executable is > > > "foo". the following command can be used > > > mpirun -np 1 -H qs22 numactl -c0 -m0 foo : -np 1 -H > > qs22 > > > numactl -c1 -m1 foo > > > > > >In the long run, I wish CBE kernel export CPU topology in /sys and > > > use PLPA to force the processor affinity. > > > > > > Best Regards, > > > Mi > > > > > > > > > > > > > > > "Lenny > > > Verkhovsky" > > > > > @gmail.com> "Open MPI Users" > > > Sent by: > > > users-bounces@ope cc > > > n-mpi.org > > >Subject > > >Re: [OMPI users] Working with a > > > 10/23/2008 05:48 CellBlade cluster > > > AM > > > > > > > > > Please respond to > > > Open MPI Users > > > > > rg> > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > If I understand you correctly the most
Re: [OMPI users] problem running Open MPI on Cells
Hi, To monitor the environment from inside the application, it could be useful to issue a 'system("printenv")' call at the very beginning of the main program, even before (and after, btw) the MPI_Init call, when running in serial job mode with a single CAB, using mpirun. HTH, Gilbert. On Fri, 31 Oct 2008, Hahn Kim wrote: > Hello, > > I'm having problems using Open MPI on a cluster of Mercury Computer's Cell > Accelerator Boards (CABs). > > We have an MPI application that is running on multiple CABs. The application > uses Mercury's MultiCore Framework (MCF) to use the Cell's SPEs. Here's the > basic problem. I can log into each CAB and run the application in serial > directly from the command line (i.e. without using mpirun) without a problem. > I can also launch a serial job onto each CAB from another machine using mpirun > without a problem. > > The problem occurs when I try to launch onto multiple CABs using mpirun. MCF > requires a license file. After the application initializes MPI, it tries to > initialized MCF on each node. The initialization routine loads the MCF > license file and checks for valid license keys. If the keys are valid, then > it continues to initialize MCF. If not, it throws an error. > > When I run on multiple CABs, most of the time several of the CABs throw an > error saying MCF cannot find a valid license key. The strange this is that > this behavior doesn't appear when I launch serial jobs using MCF, only > multiple CABs. Additionally, the errors are inconsistent. Not all the CABs > throw an error, sometimes a few of them error out, sometimes all of them, > sometimes none. > > I've talked with the Mercury folks and they're just as stumped as I am. The > only thing we can think of is that OpenMPI is somehow modifying the > environment and is interfering with MCF, but we can't think of any reason why. > > Any ideas out there? Thanks. > > Hahn > > -- > Hahn Kim, h...@ll.mit.edu > MIT Lincoln Laboratory > 244 Wood St., Lexington, MA 02420 > Tel: 781-981-0940, Fax: 781-981-5255 > > > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) -
Re: [OMPI users] Fwd: Problems installing in Cygwin
On Oct 31, 2008, at 3:20 PM, Gustavo Seabra wrote: As Jeff mentioned this component is not required on Windows. You can disable it completely in Open MPI and everything will continue to work correctly. Please add --enable-mca-no-build=memory_mallopt o maybe the more generic (as there is no need for any memory manager on Windows --enable-mca-no-build=memory. Tried, doesn't quite work: If I configure with "--enable-mca-no-build=memory", the config dies with: *** Final output configure: error: conditional "OMPI_WANT_EXTERNAL_PTMALLOC2" was never defined. Usually this means the macro was only invoked conditionally. Ew, yoinks. That's definitely a bug -- looks like we used an AM_CONDITIONAL inside the main configure.m4 for ptmalloc2; whoops (it needs to be inside MCA_memory_ptmalloc2_POST_CONFIG, not MCA_memory_ptmalloc2_CONFIG). You're building up quite the bug list -- thanks for your patience! It's probably unfortunately not that surprising, though, since we don't test on Cygwin at all... :-\ Now, if i try with "--enable-mca-no-build=memory_mallopt", the configuration script runs just fine, but the compilation dies when compiling "mca/paffinity/windows": libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../.. /ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_windows_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c paffinity_windows_module.c -DDLL_EXPORT -DPIC -o .libs/paffinity_windows_module.o paffinity_windows_module.c:44: error: parse error before "sys_info" [... and then a bunch of messages after that, all related to paffinity_windows_module.c, which...] [... I think are all related to this first one...] I do the build system stuff in OMPI, but this part is all George / Windows guys... Perhaps this is a difference compiling between "normal" windows and Cygwin...? -- Jeff Squyres Cisco Systems
Re: [OMPI users] Fwd: Problems installing in Cygwin
Gustavo, I guess that if you disable the vt contrib package, this is take you one step further :) Hopefully at the end of the compile stage ... and at the beginning of troubles with running the cygwin parallel applications ... Meanwhile, there is a special option to disable contrib packages. Please add --enable-contrib-no-build=vt to your configure line and this should do the trick. george. On Oct 31, 2008, at 3:20 PM, Gustavo Seabra wrote: On Thu, Oct 30, 2008 at 9:04 AM, George Bosilca wrote: Hi George, I'm sorry for taking too long to respond. As you mentioned, config takes a veeery long time in cygwin, and then the install itself takes many ties that :-( As Jeff mentioned this component is not required on Windows. You can disable it completely in Open MPI and everything will continue to work correctly. Please add --enable-mca-no-build=memory_mallopt o maybe the more generic (as there is no need for any memory manager on Windows --enable-mca-no-build=memory. Tried, doesn't quite work: If I configure with "--enable-mca-no-build=memory", the config dies with: *** Final output configure: error: conditional "OMPI_WANT_EXTERNAL_PTMALLOC2" was never defined. Usually this means the macro was only invoked conditionally. Now, if i try with "--enable-mca-no-build=memory_mallopt", the configuration script runs just fine, but the compilation dies when compiling "mca/paffinity/windows": libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../.. /ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_windows_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c paffinity_windows_module.c -DDLL_EXPORT -DPIC -o .libs/paffinity_windows_module.o paffinity_windows_module.c:44: error: parse error before "sys_info" [... and then a bunch of messages after that, all related to paffinity_windows_module.c, which...] [... I think are all related to this first one...] Finally, I thought that I can live without processor affinity or even memory affinity, so I tried using " --enable-mca-no-build=memory_mallopt,maffinity,paffinity", and the configuration went all smoothly. The compilation... You guessed, died again. But this time it was something that had bit me before: RTLD_NEXT, which is required by one contributed package (vt). (See my previous message to Jeff and the list.) My next attempt will be to remove this package, and see how far I can get... But I'm getting there :-) It is possible to have a native version of Open MPI on Windows. There are two ways to achieve this. First, install SFU, and compile there. It worked last time I checked, but it's not the solution I prefer. Second, you can install the express version of the Microsoft Visual Studio (which is free), and set your PATH, LIB and INCLUDE correctly to point to the installation, and then you can use the cl compiler to build Open MPI even on Windows. That is true, but it seems more complicated for the regular user than installing OpenMPI (assuming I can figure out the correct combination of options) Also, our program is actually made for unix, and as a convenience it *can* be installed in Cygwin, but I'm not sure how it would work with a native Windows OpenMPI. Anyways... I fell like I'm getting closer.. Will keep trying during the weekend. Thanks a lot for all the help! (That goes to Jeff too) Cheers, Gustavo. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Fwd: Problems installing in Cygwin
Ok, I'll CC the VT guys on the ticket and let them know. They'll likely slurp in whatever fix we do for OMPI into VT. FWIW: you can disable the VT package with: --enable-contrib-no-build=vt On Oct 31, 2008, at 3:02 PM, Gustavo Seabra wrote: As I keep trying to install OpenMPI in Cygwin, I found another instance where RTFD_NEXT is assumed to be present. Will keep trying... Gustavo. = Making all in vtlib make[5]: Entering directory `/home/seabra/local/openmpi-1.3b1/ompi/contrib/vt/vt/vtlib' gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_REENTRANT -DBINDIR=\"/home/seabra/local/openmpi-1.3b1/bin\" -DDATADIR=\"/home/seabra/local/openmpi-1.3b1/share/vampirtrace\" -DRFG -DVT_BFD -DVT_IOWRAP -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT vt_comp_gnu.o -MD -MP -MF .deps/vt_comp_gnu.Tpo -c -o vt_comp_gnu.o vt_comp_gnu.c mv -f .deps/vt_comp_gnu.Tpo .deps/vt_comp_gnu.Po gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_REENTRANT -DBINDIR=\"/home/seabra/local/openmpi-1.3b1/bin\" -DDATADIR=\"/home/seabra/local/openmpi-1.3b1/share/vampirtrace\" -DRFG -DVT_BFD -DVT_IOWRAP -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT vt_iowrap.o -MD -MP -MF .deps/vt_iowrap.Tpo -c -o vt_iowrap.o vt_iowrap.c vt_iowrap.c: In function `vt_iowrap_init': vt_iowrap.c:105: error: `RTLD_NEXT' undeclared (first use in this function) vt_iowrap.c:105: error: (Each undeclared identifier is reported only once vt_iowrap.c:105: error: for each function it appears in.) vt_iowrap.c: In function `open': vt_iowrap.c:188: error: `RTLD_NEXT' undeclared (first use in this function) [...and a bunch of messages just like those last 2 lines...] ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Working with a CellBlade cluster
AFAIK, there are no parameters available to monitor IB message passing. The majority of it is processed in hardware, and Linux is unaware of it. We have not added any extra instrumentation into the openib BTL to provide auditing information, because, among other reasons, that is the performance-critical code path and we didn't want to add any latency in there. The best you may be able to do is with a PMPI-based library to audit MPI function call invocations. On Oct 31, 2008, at 4:07 PM, Mi Yan wrote: Gilbert, I did not know the MCA parameters that can monitor the message passing. I have tried a few MCA verbose parameters and did not identify anyone helpful. One way to check if the message goes via IB or SM maybe to check the counters in /sys/class/infiniband. Regards, Mi Gilbert GrosdidierGilbert Grosdidier Sent by: users-boun...@open-mpi.org 10/29/2008 12:36 PM Please respond to Open MPI Users To Open MPI Users cc Subject Re: [OMPI users] Working with a CellBlade cluster Thank you very much Mi and Lenny for your detailed replies. I believe I can summarize the infos to allow for 'Working with a QS22 CellBlade cluster' like this: - Yes, messages are efficiently handled with "-mca btl openib,sm,self" - Better to go to the OMPI-1.3 version ASAP - It is currently more efficient/easy to use numactl to control processor affinity on a QS22. So far so good. One question remains: how could I monitor in details message passing thru IB (on one side) and thru SM (on the other side) thru the use of mca parameters, please ? Additionnal info about the verbosity level of this monitoring will be highly appreciated ... A lengthy travel inside the list of such parameters provided by ompi_info did not enlighten me (there are so many xxx_sm_yyy type params that I don't know which could be the right one ;-) Thanks in advance for your hints, Best Regards, Gilbert. On Thu, 23 Oct 2008, Mi Yan wrote: > > 1. MCA BTL parameters > With "-mca btl openib,self", both message between two Cell processors on > one QS22 and messages between two QS22s go through IB. > > With "-mca btl openib,sm,slef", message on one QS22 go through shared > memory, message between QS22 go through IB, > > Depending on the message size and other MCA parameters, it does not > guarantee message passing on shared memory is faster than on IB. E.g. > the bandwidth for 64KB message is 959MB/s on shared-memory and is 694MB/s > on IB; the bandwidth for 4MB message is 539 MB/s and 1092 MB/s on IB. > The bandwidth of 4MB message on shared memory may be higher if you tune > some MCA parameter. > > 2. mpi_paffinity_alone > "mpi_paffinity_alone =1" is not a good choice for QS22. There are two > sockets with two physical Cell/B.E. on one QS22. Each Cell/B.E. has two > SMT threads. So there are four logical CPUs on one QS22. CBE Linux > kernel maps logical cpu 0 and 1 to socket1 and maps logical cpu 1 and 2 to > socket 2.If mpi_paffinity_alone is set to 1, the two MPI instances > will be assigned to logical cpu 0 and cpu 1 on socket 1. I believe this is > not what you want. > > A temporaily solution to force the affinity on QS22 is to use > "numactl", E.g. assuming the hostname is "qs22" and the executable is > "foo". the following command can be used > mpirun -np 1 -H qs22 numactl -c0 -m0 foo : -np 1 -H qs22 > numactl -c1 -m1 foo > >In the long run, I wish CBE kernel export CPU topology in / sys and > use PLPA to force the processor affinity. > > Best Regards, > Mi > > > > > "Lenny > Verkhovsky" > > @gmail.com> "Open MPI Users" > Sent by: > users- bounces@ope cc > n-mpi.org > Subject >Re: [OMPI users] Working with a > 10/23/2008 05:48 CellBlade cluster > AM > > > Please respond to > Open MPI Users > rg> > > > > > > > Hi, > > > If I understand you correctly the most suitable way to do it is by > paffinity that we have in Open MPI 1.3 and the trank. > how ever usually OS is distributing processes evenly between sockets by it > self. > > There still no formal FAQ due to a multiple reasons but you can read how to > use it in the attached scratch ( there were few name changings of the > params, so check with ompi_info ) > > shared memory is used between processes that share same machine, and openib > is used between different machines ( hostnames ), no special mca params are > needed. > > Best Regards >
[OMPI users] problem running Open MPI on Cells
Hello, I'm having problems using Open MPI on a cluster of Mercury Computer's Cell Accelerator Boards (CABs). We have an MPI application that is running on multiple CABs. The application uses Mercury's MultiCore Framework (MCF) to use the Cell's SPEs. Here's the basic problem. I can log into each CAB and run the application in serial directly from the command line (i.e. without using mpirun) without a problem. I can also launch a serial job onto each CAB from another machine using mpirun without a problem. The problem occurs when I try to launch onto multiple CABs using mpirun. MCF requires a license file. After the application initializes MPI, it tries to initialized MCF on each node. The initialization routine loads the MCF license file and checks for valid license keys. If the keys are valid, then it continues to initialize MCF. If not, it throws an error. When I run on multiple CABs, most of the time several of the CABs throw an error saying MCF cannot find a valid license key. The strange this is that this behavior doesn't appear when I launch serial jobs using MCF, only multiple CABs. Additionally, the errors are inconsistent. Not all the CABs throw an error, sometimes a few of them error out, sometimes all of them, sometimes none. I've talked with the Mercury folks and they're just as stumped as I am. The only thing we can think of is that OpenMPI is somehow modifying the environment and is interfering with MCF, but we can't think of any reason why. Any ideas out there? Thanks. Hahn -- Hahn Kim, h...@ll.mit.edu MIT Lincoln Laboratory 244 Wood St., Lexington, MA 02420 Tel: 781-981-0940, Fax: 781-981-5255
Re: [OMPI users] Working with a CellBlade cluster
Gilbert, I did not know the MCA parameters that can monitor the message passing. I have tried a few MCA verbose parameters and did not identify anyone helpful. One way to check if the message goes via IB or SM maybe to check the counters in /sys/class/infiniband. Regards, Mi Gilbert GrosdidierOpen MPI Users Sent by: cc users-bounces@ope n-mpi.org Subject Re: [OMPI users] Working with a CellBlade cluster 10/29/2008 12:36 PM Please respond to Open MPI Users Thank you very much Mi and Lenny for your detailed replies. I believe I can summarize the infos to allow for 'Working with a QS22 CellBlade cluster' like this: - Yes, messages are efficiently handled with "-mca btl openib,sm,self" - Better to go to the OMPI-1.3 version ASAP - It is currently more efficient/easy to use numactl to control processor affinity on a QS22. So far so good. One question remains: how could I monitor in details message passing thru IB (on one side) and thru SM (on the other side) thru the use of mca parameters, please ? Additionnal info about the verbosity level of this monitoring will be highly appreciated ... A lengthy travel inside the list of such parameters provided by ompi_info did not enlighten me (there are so many xxx_sm_yyy type params that I don't know which could be the right one ;-) Thanks in advance for your hints, Best Regards, Gilbert. On Thu, 23 Oct 2008, Mi Yan wrote: > > 1. MCA BTL parameters > With "-mca btl openib,self", both message between two Cell processors on > one QS22 and messages between two QS22s go through IB. > > With "-mca btl openib,sm,slef", message on one QS22 go through shared > memory, message between QS22 go through IB, > > Depending on the message size and other MCA parameters, it does not > guarantee message passing on shared memory is faster than on IB. E.g. > the bandwidth for 64KB message is 959MB/s on shared-memory and is 694MB/s > on IB; the bandwidth for 4MB message is 539 MB/s and 1092 MB/s on IB. > The bandwidth of 4MB message on shared memory may be higher if you tune > some MCA parameter. > > 2. mpi_paffinity_alone > "mpi_paffinity_alone =1" is not a good choice for QS22. There are two > sockets with two physical Cell/B.E. on one QS22. Each Cell/B.E. has two > SMT threads. So there are four logical CPUs on one QS22. CBE Linux > kernel maps logical cpu 0 and 1 to socket1 and maps logical cpu 1 and 2 to > socket 2.If mpi_paffinity_alone is set to 1, the two MPI instances > will be assigned to logical cpu 0 and cpu 1 on socket 1. I believe this is > not what you want. > > A temporaily solution to force the affinity on QS22 is to use > "numactl", E.g. assuming the hostname is "qs22" and the executable is > "foo". the following command can be used > mpirun -np 1 -H qs22 numactl -c0 -m0 foo : -np 1 -H qs22 > numactl -c1 -m1 foo > >In the long run, I wish CBE kernel export CPU topology in /sys and > use PLPA to force the processor affinity. > > Best Regards, > Mi > > > > > "Lenny > Verkhovsky" >@gmail.com> "Open MPI Users" > Sent by: > users-bounces@ope cc > n-mpi.org > Subject >Re: [OMPI users] Working with a > 10/23/2008 05:48 CellBlade cluster > AM > > > Please respond to > Open MPI Users > rg> > > > > > > > Hi, > > > If I understand you correctly the most suitable way to do it is by > paffinity that we have in Open MPI 1.3 and the trank. >
Re: [OMPI users] MPI + Mixed language coding(Fortran90 + C++)
On Fri, Oct 31, 2008 at 3:07 PM, Rajesh Ramaya wrote: > Actually I am > not writing any MPI code inside? It's the executable (third party software) > who does that part. What are you using for this? We too use Fortran and C routines combined with no problem at all. I would think that whatever "third-party" software you are using here is not doing its job right. -- Gustavo Seabra Postdoctoral Associate Quantum Theory Project - University of Florida Gainesville - Florida - USA
Re: [OMPI users] Fwd: Problems installing in Cygwin
On Thu, Oct 30, 2008 at 9:04 AM, George Bosilca wrote: Hi George, I'm sorry for taking too long to respond. As you mentioned, config takes a veeery long time in cygwin, and then the install itself takes many ties that :-( > As Jeff mentioned this component is not required on Windows. You can disable > it completely in Open MPI and everything will continue to work correctly. > Please add --enable-mca-no-build=memory_mallopt o maybe the more generic (as > there is no need for any memory manager on Windows > --enable-mca-no-build=memory. Tried, doesn't quite work: If I configure with "--enable-mca-no-build=memory", the config dies with: *** Final output configure: error: conditional "OMPI_WANT_EXTERNAL_PTMALLOC2" was never defined. Usually this means the macro was only invoked conditionally. Now, if i try with "--enable-mca-no-build=memory_mallopt", the configuration script runs just fine, but the compilation dies when compiling "mca/paffinity/windows": libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../.. /ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT paffinity_windows_module.lo -MD -MP -MF .deps/paffinity_windows_module.Tpo -c paffinity_windows_module.c -DDLL_EXPORT -DPIC -o .libs/paffinity_windows_module.o paffinity_windows_module.c:44: error: parse error before "sys_info" [... and then a bunch of messages after that, all related to paffinity_windows_module.c, which...] [... I think are all related to this first one...] Finally, I thought that I can live without processor affinity or even memory affinity, so I tried using " --enable-mca-no-build=memory_mallopt,maffinity,paffinity", and the configuration went all smoothly. The compilation... You guessed, died again. But this time it was something that had bit me before: RTLD_NEXT, which is required by one contributed package (vt). (See my previous message to Jeff and the list.) My next attempt will be to remove this package, and see how far I can get... But I'm getting there :-) > It is possible to have a native version of Open MPI on Windows. There are > two ways to achieve this. First, install SFU, and compile there. It worked > last time I checked, but it's not the solution I prefer. Second, you can > install the express version of the Microsoft Visual Studio (which is free), > and set your PATH, LIB and INCLUDE correctly to point to the installation, > and then you can use the cl compiler to build Open MPI even on Windows. That is true, but it seems more complicated for the regular user than installing OpenMPI (assuming I can figure out the correct combination of options) Also, our program is actually made for unix, and as a convenience it *can* be installed in Cygwin, but I'm not sure how it would work with a native Windows OpenMPI. Anyways... I fell like I'm getting closer.. Will keep trying during the weekend. Thanks a lot for all the help! (That goes to Jeff too) Cheers, Gustavo.
Re: [OMPI users] MPI + Mixed language coding(Fortran90 + C++)
Hello Jeff Squyres, Thank you very much for the immediate reply. I am able to successfully access the data from the common block but the values are zero. In my algorithm I even update a common block but the update made by the shared library is not taken in to account by the executable. Can you please be very specific how to make the parallel algorithm aware of the data? Actually I am not writing any MPI code inside? It's the executable (third party software) who does that part. All that I am doing is to compile my code with MPI c compiler and add it in the LD_LIBIRARY_PATH. In fact I did a simple test by creating a shared library using a FORTRAN code and the update made to the common block is taken in to account by the executable. Is there any flag or pragma that need to be activated for mixed language MPI? Thank you once again for the reply. Rajesh -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: vendredi 31 octobre 2008 18:53 To: Open MPI Users Subject: Re: [OMPI users] MPI + Mixed language coding(Fortran90 + C++) On Oct 31, 2008, at 11:57 AM, Rajesh Ramaya wrote: > I am completely new to MPI. I have a basic question concerning > MPI and mixed language coding. I hope any of you could help me out. > Is it possible to access FORTRAN common blocks in C++ in a MPI > compiled code. It works without MPI but as soon I switch to MPI the > access of common block does not work anymore. > I have a Linux MPI executable which loads a shared library at > runtime and resolves all undefined symbols etc The shared library > is written in C++ and the MPI executable in written in FORTRAN. Some > of the input that the shared library looking for are in the Fortran > common blocks. As I access those common blocks during runtime the > values are not initialized. I would like to know if what I am > doing is possible ?I hope that my problem is clear.. Generally, MPI should not get in the way of sharing common blocks between Fortran and C/C++. Indeed, in Open MPI itself, we share a few common blocks between Fortran and the main C Open MPI implementation. What is the exact symptom that you are seeing? Is the application failing to resolve symbols at run-time, possibly indicating that something hasn't instantiated a common block? Or are you able to successfully access the data from the common block, but it doesn't have the values you expect (e.g., perhaps you're seeing all zeros)? If the former, you might want to check your build procedure. You *should* be able to simply replace your C++ / F90 compilers with mpicxx and mpif90, respectively, and be able to build an MPI version of your app. If the latter, you might need to make your parallel algorithm aware of what data is available in which MPI process -- perhaps not all the data is filled in on each MPI process...? -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Fwd: Problems installing in Cygwin
As I keep trying to install OpenMPI in Cygwin, I found another instance where RTFD_NEXT is assumed to be present. Will keep trying... Gustavo. = Making all in vtlib make[5]: Entering directory `/home/seabra/local/openmpi-1.3b1/ompi/contrib/vt/vt/vtlib' gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_REENTRANT -DBINDIR=\"/home/seabra/local/openmpi-1.3b1/bin\" -DDATADIR=\"/home/seabra/local/openmpi-1.3b1/share/vampirtrace\" -DRFG -DVT_BFD -DVT_IOWRAP -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT vt_comp_gnu.o -MD -MP -MF .deps/vt_comp_gnu.Tpo -c -o vt_comp_gnu.o vt_comp_gnu.c mv -f .deps/vt_comp_gnu.Tpo .deps/vt_comp_gnu.Po gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_REENTRANT -DBINDIR=\"/home/seabra/local/openmpi-1.3b1/bin\" -DDATADIR=\"/home/seabra/local/openmpi-1.3b1/share/vampirtrace\" -DRFG -DVT_BFD -DVT_IOWRAP -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT vt_iowrap.o -MD -MP -MF .deps/vt_iowrap.Tpo -c -o vt_iowrap.o vt_iowrap.c vt_iowrap.c: In function `vt_iowrap_init': vt_iowrap.c:105: error: `RTLD_NEXT' undeclared (first use in this function) vt_iowrap.c:105: error: (Each undeclared identifier is reported only once vt_iowrap.c:105: error: for each function it appears in.) vt_iowrap.c: In function `open': vt_iowrap.c:188: error: `RTLD_NEXT' undeclared (first use in this function) [...and a bunch of messages just like those last 2 lines...]
Re: [OMPI users] MPI + Mixed language coding(Fortran90 + C++)
On Oct 31, 2008, at 11:57 AM, Rajesh Ramaya wrote: I am completely new to MPI. I have a basic question concerning MPI and mixed language coding. I hope any of you could help me out. Is it possible to access FORTRAN common blocks in C++ in a MPI compiled code. It works without MPI but as soon I switch to MPI the access of common block does not work anymore. I have a Linux MPI executable which loads a shared library at runtime and resolves all undefined symbols etc The shared library is written in C++ and the MPI executable in written in FORTRAN. Some of the input that the shared library looking for are in the Fortran common blocks. As I access those common blocks during runtime the values are not initialized. I would like to know if what I am doing is possible ?I hope that my problem is clear.. Generally, MPI should not get in the way of sharing common blocks between Fortran and C/C++. Indeed, in Open MPI itself, we share a few common blocks between Fortran and the main C Open MPI implementation. What is the exact symptom that you are seeing? Is the application failing to resolve symbols at run-time, possibly indicating that something hasn't instantiated a common block? Or are you able to successfully access the data from the common block, but it doesn't have the values you expect (e.g., perhaps you're seeing all zeros)? If the former, you might want to check your build procedure. You *should* be able to simply replace your C++ / F90 compilers with mpicxx and mpif90, respectively, and be able to build an MPI version of your app. If the latter, you might need to make your parallel algorithm aware of what data is available in which MPI process -- perhaps not all the data is filled in on each MPI process...? -- Jeff Squyres Cisco Systems
Re: [OMPI users] ompi-checkpoint is hanging
After some additional testing I believe that I have been able to reproduce the problem. I suspect that there is a bug in the coordination protocol that is causing an occasional hang in the system. Since it only happens occasionally (though slightly more often on a fully loaded machine) that is probably how I missed it in my testing. I'll work on a patch, and let you know when it is ready. Unfortunately it probably won't be for a couple weeks. :( You can increase the verbose level for all of the fault tolerance frameworks and components through MCA parameters. They are referenced in the FT C/R User Doc on the Open MPI wiki, and you can access them through 'ompi-info'. You will look for the following frameworks/ components: - crs/blcr - snapc/full - crcp/bkmrk - opal_cr_verbose - orte_cr_verbose - ompi_cr_verbose Thanks for the bug report. I filed a ticket in our bug tracker, and CC'ed you on it. The ticket is: http://svn.open-mpi.org/trac/ompi/ticket/1619 Cheers, Josh On Oct 31, 2008, at 10:51 AM, Matthias Hovestadt wrote: Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one MPI job if you kill the other MPI job after the first "hangs"? Nothing, it keeps hanging. > (It may not be a true hang, but very very slow progress that you > are observing.) I already waited for more than 12 hours, but the ompi-checkpoint did not return. So if it's slow, it must be very slow. I continued testing and just observed a case where the problem occurred with only one job running on the compute node: --- ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep ccs 7706 0.4 0.2 63864 2640 ?S15:35 0:00 mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/ loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O planet.tga ccs@grid-demo-1:~$ --- The resource management system tried to checkpoint this job using the command "ompi-checkpoint -v --term 7706". This is the output of that command: --- [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08178] PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Running - Global Snapshot Reference: (null) --- If I look to the activity on the node, I see that the processes are still computing: --- PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x- povray 7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x- povray 7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x- povray 7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x- povray 7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x- povray 7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x- povray --- Now I killed the hanging ompi-checkpoint operation and tried to execute a checkpoint manually: --- ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08224] PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint
[OMPI users] MPI + Mixed language coding(Fortran90 + C++)
Hello MPI Users, I am completely new to MPI. I have a basic question concerning MPI and mixed language coding. I hope any of you could help me out. Is it possible to access FORTRAN common blocks in C++ in a MPI compiled code. It works without MPI but as soon I switch to MPI the access of common block does not work anymore. I have a Linux MPI executable which loads a shared library at runtime and resolves all undefined symbols etc The shared library is written in C++ and the MPI executable in written in FORTRAN. Some of the input that the shared library looking for are in the Fortran common blocks. As I access those common blocks during runtime the values are not initialized. I would like to know if what I am doing is possible ?I hope that my problem is clear.. Your valuable suggestions are welcome !!! Thank you, Rajesh
Re: [OMPI users] users Digest, Vol 1052, Issue 1
It looks like the daemon isn't seeing the other interface address on host x2. Can you ssh to x2 and send the contents of ifconfig -a? Ralph On Oct 31, 2008, at 9:18 AM, Allan Menezes wrote: users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Openmpi ver1.3beta1 (Allan Menezes) 2. Re: Openmpi ver1.3beta1 (Ralph Castain) 3. Re: Equivalent .h files (Benjamin Lamptey) 4. Re: Equivalent .h files (Jeff Squyres) 5. ompi-checkpoint is hanging (Matthias Hovestadt) 6. unsubscibe (Bertrand P. S. Russell) 7. Re: ompi-checkpoint is hanging (Tim Mattox) -- Message: 1 Date: Fri, 31 Oct 2008 02:06:09 -0400 From: Allan MenezesSubject: [OMPI users] Openmpi ver1.3beta1 To: us...@open-mpi.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes -- Message: 2 Date: Fri, 31 Oct 2008 02:41:59 -0600 From: Ralph Castain Subject: Re: [OMPI users] Openmpi ver1.3beta1 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes When you typed the --host x1 command, were you sitting on x1? Likewise, when you typed the --host x2 command, were you not on host x2? If the answer to both questions is "yes", then my guess is that something is preventing you from launching a daemon on host x2. Try adding --leave-session-attached to your cmd line and see if any error messages appear. And check the FAQ for tips on how to setup for ssh launch (I'm assuming that is what you are using). http://www.open-mpi.org/faq/?category=rsh Ralph On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote: Hi Ralph, Yes that is true I tried both commands on x1 and ver 1.28 works on the same setup without a problem. Here is the output with the added --leave-session-attached [allan@x1 ~]$ mpiexec --prefix /opt/openmpi13b2 --leave-session- attached -host x2 hostname [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.0.198:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.122.1:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1] routed:binomial: Connection to lifeline [[1354,0],0] lost -- A daemon (pid 7665) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. -- mpiexec: clean termination accomplished [allan@x1 ~]$ However my main eth0 IP is 192.168.1.1 and internet gate way is 192.168.0.1 Any solutions? Allan Menezes Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with- threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1
[OMPI users] MPI_Type_create_darray causes MPI_File_set_view to crash when ndims=2, array_of_gsizes[0]>array_of_gsizes[1]
Hi again, The problem in a nutshell: it looks like, when I use MPI_Type_create_darray with an argument array_of_gsizes where array_of_gsizes[0]>array_of_gsizes[1], the datatype returned goes through MPI_Type_commit() just fine, but then it causes MPI_File_set_view to crash!! Any idea as to why this is happening? A Antonio Molins, PhD Candidate Medical Engineering and Medical Physics Harvard - MIT Division of Health Sciences and Technology -- "When a traveler reaches a fork in the road, the ℓ1 -norm tells him to take either one way or the other, but the ℓ2 -norm instructs him to head off into the bushes. " John F. Claerbout and Francis Muir, 1973 *** glibc detected *** double free or corruption (!prev): 0x00cf4130 *** [login4:26709] *** Process received signal *** [login4:26708] *** Process received signal *** [login4:26708] Signal: Aborted (6) [login4:26708] Signal code: (-6) [login4:26709] Signal: Segmentation fault (11) [login4:26709] Signal code: Address not mapped (1) [login4:26709] Failing at address: 0x18 [login4:26708] [ 0] /lib64/tls/libpthread.so.0 [0x36ff10c5b0] [login4:26708] [ 1] /lib64/tls/libc.so.6(gsignal+0x3d) [0x36fe62e26d] [login4:26708] [ 2] /lib64/tls/libc.so.6(abort+0xfe) [0x36fe62fa6e] [login4:26708] [ 3] /lib64/tls/libc.so.6 [0x36fe6635f1] [login4:26708] [ 4] /lib64/tls/libc.so.6 [0x36fe6691fe] [login4:26708] [ 5] /lib64/tls/libc.so.6(__libc_free+0x76) [0x36fe669596] [login4:26708] [ 6] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so.0 [0x2a962cc4ae] [login4:26708] [ 7] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(ompi_ddt_destroy+0x65) [0x2a962cd31d] [login4:26708] [ 8] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(MPI_Type_free+0x5b) [0x2a962f654f] [login4:26708] [ 9] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIOI_Flatten+0x1804) [0x2aa4603612] [login4:26708] [10] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIOI_Flatten_datatype+0xe7) [0x2aa46017fd] [login4:26708] [11] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIO_Set_view+0x14f) [0x2aa45ecb57] [login4:26708] [12] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x1dd) [0x2aa46088a9] [login4:26708] [13] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so [0x2aa45ec288] [login4:26708] [14] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(MPI_File_set_view+0x53) [0x2a963002ff] [login4:26708] [15] ./bin/test2(_ZN14pMatCollection3getEiP7pMatrix +0xc3) [0x42a50b] [login4:26708] [16] ./bin/test2(main+0xc2e) [0x43014a] [login4:26708] [17] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x36fe61c40b] [login4:26708] [18] ./bin/test2(_ZNSt8ios_base4InitD1Ev+0x42) [0x41563a] [login4:26708] *** End of error message *** [login4:26709] [ 0] /lib64/tls/libpthread.so.0 [0x36ff10c5b0] [login4:26709] [ 1] /lib64/tls/libc.so.6 [0x36fe66882b] [login4:26709] [ 2] /lib64/tls/libc.so.6 [0x36fe668f8d] [login4:26709] [ 3] /lib64/tls/libc.so.6(__libc_free+0x76) [0x36fe669596] [login4:26709] [ 4] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so.0 [0x2a962cc4ae] [login4:26709] [ 5] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(ompi_ddt_release_args+0x93) [0x2a962d5641] [login4:26709] [ 6] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so.0 [0x2a962cc514] [login4:26709] [ 7] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(ompi_ddt_release_args+0x93) [0x2a962d5641] [login4:26709] [ 8] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so.0 [0x2a962cc514] [login4:26709] [ 9] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(ompi_ddt_destroy+0x65) [0x2a962cd31d] [login4:26709] [10] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(MPI_Type_free+0x5b) [0x2a962f654f] [login4:26709] [11] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIOI_Flatten+0x147) [0x2aa4601f55] [login4:26709] [12] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIOI_Flatten+0x1569) [0x2aa4603377] [login4:26709] [13] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIOI_Flatten_datatype+0xe7) [0x2aa46017fd] [login4:26709] [14] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(ADIO_Set_view+0x14f) [0x2aa45ecb57] [login4:26709] [15] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x1dd) [0x2aa46088a9] [login4:26709] [16] /opt/apps/intel10_1/openmpi/1.3/lib/openmpi/ mca_io_romio.so [0x2aa45ec288] [login4:26709] [17] /opt/apps/intel10_1/openmpi/1.3/lib/libmpi.so. 0(MPI_File_set_view+0x53) [0x2a963002ff] [login4:26709] [18] ./bin/test2(_ZN14pMatCollection3getEiP7pMatrix +0xc3) [0x42a50b] [login4:26709] [19] ./bin/test2(main+0xc2e) [0x43014a] [login4:26709] [20] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x36fe61c40b] [login4:26709]
Re: [OMPI users] users Digest, Vol 1052, Issue 1
users-requ...@open-mpi.org wrote: Send users mailing list submissions to us...@open-mpi.org To subscribe or unsubscribe via the World Wide Web, visit http://www.open-mpi.org/mailman/listinfo.cgi/users or, via email, send a message with subject or body 'help' to users-requ...@open-mpi.org You can reach the person managing the list at users-ow...@open-mpi.org When replying, please edit your Subject line so it is more specific than "Re: Contents of users digest..." Today's Topics: 1. Openmpi ver1.3beta1 (Allan Menezes) 2. Re: Openmpi ver1.3beta1 (Ralph Castain) 3. Re: Equivalent .h files (Benjamin Lamptey) 4. Re: Equivalent .h files (Jeff Squyres) 5. ompi-checkpoint is hanging (Matthias Hovestadt) 6. unsubscibe (Bertrand P. S. Russell) 7. Re: ompi-checkpoint is hanging (Tim Mattox) -- Message: 1 Date: Fri, 31 Oct 2008 02:06:09 -0400 From: Allan MenezesSubject: [OMPI users] Openmpi ver1.3beta1 To: us...@open-mpi.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes -- Message: 2 Date: Fri, 31 Oct 2008 02:41:59 -0600 From: Ralph Castain Subject: Re: [OMPI users] Openmpi ver1.3beta1 To: Open MPI Users Message-ID: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes When you typed the --host x1 command, were you sitting on x1? Likewise, when you typed the --host x2 command, were you not on host x2? If the answer to both questions is "yes", then my guess is that something is preventing you from launching a daemon on host x2. Try adding --leave-session-attached to your cmd line and see if any error messages appear. And check the FAQ for tips on how to setup for ssh launch (I'm assuming that is what you are using). http://www.open-mpi.org/faq/?category=rsh Ralph On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote: Hi Ralph, Yes that is true I tried both commands on x1 and ver 1.28 works on the same setup without a problem. Here is the output with the added --leave-session-attached [allan@x1 ~]$ mpiexec --prefix /opt/openmpi13b2 --leave-session-attached -host x2 hostname [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.0.198:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1]-[[1354,0],0] mca_oob_tcp_peer_try_connect: connect to 192.168.122.1:0 failed: Network is unreachable (101) [x2.brampton.net:02236] [[1354,0],1] routed:binomial: Connection to lifeline [[1354,0],0] lost -- A daemon (pid 7665) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. -- mpiexec: clean termination accomplished [allan@x1 ~]$ However my main eth0 IP is 192.168.1.1 and internet gate way is 192.168.0.1 Any solutions? Allan Menezes Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with- threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core
Re: [OMPI users] ompi-checkpoint is hanging
Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one MPI job if you kill the other MPI job after the first "hangs"? Nothing, it keeps hanging. > (It may not be a true hang, but very very slow progress that you > are observing.) I already waited for more than 12 hours, but the ompi-checkpoint did not return. So if it's slow, it must be very slow. I continued testing and just observed a case where the problem occurred with only one job running on the compute node: --- ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep ccs 7706 0.4 0.2 63864 2640 ?S15:35 0:00 mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O planet.tga ccs@grid-demo-1:~$ --- The resource management system tried to checkpoint this job using the command "ompi-checkpoint -v --term 7706". This is the output of that command: --- [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08178] PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Running - Global Snapshot Reference: (null) --- If I look to the activity on the node, I see that the processes are still computing: --- PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-povray 7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-povray 7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-povray 7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-povray 7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-povray 7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-povray --- Now I killed the hanging ompi-checkpoint operation and tried to execute a checkpoint manually: --- ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08224] PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] --- Is there perhaps a way of increasing the level of debug output? Please let me know if I can support you in any way... Best, Matthias
Re: [OMPI users] Issues with MPI_Type_create_darray
Hi again, Using MPI_Type_get_true_extent(), I changed the way of reporting type size and extent to: int typesize; long typeextent, typelb; MPI_Type_size(this->datatype,); MPI_Type_get_true_extent(this->datatype,,); //MPI_Type_lb(this->datatype,); //MPI_Type_extent(this->datatype,); printf("\ntype size for process rank (%d,%d) is %d doubles, type extent is %d doubles (up to %d), range is [%d, %d].\n",pr,pc,typesize/ (int)sizeof(double),(int)(typeextent/sizeof(double)),nx*ny,(int) (typelb/sizeof(double)),(int)((typelb+typeextent)/sizeof(double))); Which now is giving me the correct answers for both situations. For the first one (works): type size for process rank (1,0) is 20 doubles, type extent is 60 doubles (up to 91), range is [28, 88]. type size for process rank (0,0) is 32 doubles, type extent is 81 doubles (up to 91), range is [0, 81]. type size for process rank (0,1) is 24 doubles, type extent is 80 doubles (up to 91), range is [4, 84]. type size for process rank (1,1) is 15 doubles, type extent is 59 doubles (up to 91), range is [32, 91]. For the second one (before getting the same double free error with MPI_File_set_view): type size for process rank (1,0) is 20 doubles, type extent is 48 doubles (up to 91), range is [4, 52]. type size for process rank (0,0) is 32 doubles, type extent is 51 doubles (up to 91), range is [0, 51]. type size for process rank (0,1) is 24 doubles, type extent is 38 doubles (up to 91), range is [52, 90]. type size for process rank (1,1) is 15 doubles, type extent is 35 doubles (up to 91), range is [56, 91]. Can anybody give me a hint here? Is there a bug in MPI_Type_create_darray I should be aware of? Best, A On Oct 30, 2008, at 5:21 PM, Antonio Molins wrote: Hi all, I am having some trouble with this function. I want to map data to a 2x2 block-cyclic configuration in C, using the code: MPI_Barrier(blacs_comm); // size of each matrix int *array_of_gsizes = new int[2]; array_of_gsizes[0]=this->nx; array_of_gsizes[1]=this->ny; // block-cyclic distritution used by ScaLAPACK int *array_of_distrs = new int[2]; array_of_distrs[0]=MPI_DISTRIBUTE_CYCLIC; array_of_distrs[1]=MPI_DISTRIBUTE_CYCLIC; int *array_of_dargs = new int[2]; array_of_dargs[0]=BLOCK_SIZE; array_of_dargs[1]=BLOCK_SIZE; int *array_of_psizes = new int[2]; array_of_psizes[0]=Pr; array_of_psizes[1]=Pc; int rank = pc+pr*Pc; MPI_Type_create_darray(Pr*Pc,rank, 2,array_of_gsizes,array_of_distrs,array_of_dargs, array_of_psizes,MPI_ORDER_C,MPI_DOUBLE,>datatype); MPI_Type_commit(>datatype); int typesize; long typeextent; MPI_Type_size(this->datatype,); MPI_Type_extent(this->datatype,); printf("type size for process rank (%d,%d) is %d doubles, type extent is %d doubles (up to %d).",pr,pc,typesize/(int)sizeof(double), (int)(typeextent/sizeof(double)),nx*ny); MPI_File_open(blacs_comm,(char*)filename, MPI_MODE_RDWR, MPI_INFO_NULL, >fid); MPI_File_set_view(this->fid,this->offset +i*nx*ny*sizeof(double),MPI_DOUBLE,this- >datatype,"native",MPI_INFO_NULL); This works well when used like this, but problem is that the matrix itself is written in disk column-major fashion, so I would want to use the code as if I was reading it transposed, that is: MPI_Barrier(blacs_comm); // size of each matrix int *array_of_gsizes = new int[2]; array_of_gsizes[0]=this->ny; array_of_gsizes[1]=this->nx; // block-cyclic distritution used by ScaLAPACK int *array_of_distrs = new int[2]; array_of_distrs[0]=MPI_DISTRIBUTE_CYCLIC; array_of_distrs[1]=MPI_DISTRIBUTE_CYCLIC; int *array_of_dargs = new int[2]; array_of_dargs[0]=BLOCK_SIZE; array_of_dargs[1]=BLOCK_SIZE; int *array_of_psizes = new int[2]; array_of_psizes[0]=Pr; array_of_psizes[1]=Pc; int rank = pr+pc*Pr; MPI_Type_create_darray(Pr*Pc,rank, 2,array_of_gsizes,array_of_distrs,array_of_dargs, array_of_psizes,MPI_ORDER_C,MPI_DOUBLE,>datatype); MPI_Type_commit(>datatype); MPI_Type_size(this->datatype,); MPI_Type_extent(this->datatype,); printf("type size for process rank (%d,%d) is %d doubles, type extent is %d doubles (up to %d).",pr,pc,typesize/(int)sizeof(double), (int)(typeextent/sizeof(double)),nx*ny); MPI_File_open(blacs_comm,(char*)filename, MPI_MODE_RDWR, MPI_INFO_NULL, >fid); MPI_File_set_view(this->fid,this->offset +i*nx*ny*sizeof(double),MPI_DOUBLE,this- >datatype,"native",MPI_INFO_NULL); To my surprise, this code crashes while calling MPI_File_set_view()!!! And before you ask, I did try switching
Re: [OMPI users] ompi-checkpoint is hanging
Hello Matthias, Hopefully Josh will chime in shortly. But I have one suggestion to help diagnose this. Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. I know that isn't a solution, but it may help us diagnose what is going on. (It may not be a true hang, but very very slow progress that you are observing.) Also, what happens to the checkpointing of one MPI job if you kill the other MPI job after the first "hangs"? On Fri, Oct 31, 2008 at 8:18 AM, Matthias Hovestadtwrote: > Hi! > > I'm using the development version of OMPI from SVN (rev. 19857) > for executing MPI jobs on my cluster system. I'm particularly using > the checkpoint and restart feature, basing on the currentmost version > of BLCR. > > The checkpointing is working pretty fine as long as I only execute > a single job on a node. If more than one MPI application is executing > on a system, ompi-checkpoint sometimes does not return, hanging forever. > > > Example: checkpointing with a single running application > > I'm using the MPI-enabled flavor of Povray as demo application. So I'm > starting it on a node using the following command. > > mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \ > +O planet.tga > > This gives me 4 MPI processes, all running on the local node. > checkpointing it with > > ompi-checkpoint -v --term 7022 > > (where 7022 is the PID of the mpirun process) gives me a checkpoint > dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting > the job. > > The ompi-checkpoint command gives the following output: > > --- > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing... > [grid-demo-1.cit.tu-berlin.de:07480] PID 7022 > [grid-demo-1.cit.tu-berlin.de:07480] Connected to Mpirun [[2899,0],0] > [grid-demo-1.cit.tu-berlin.de:07480] Terminating after checkpoint > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Contact > Head Node Process PID 7022 > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Requested > a checkpoint of jobid [INVALID] > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Requested - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Running - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global > Snapshot Reference: (null) > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive > a command message. > [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status > Update. > [grid-demo-1.cit.tu-berlin.de:07480] Finished - Global > Snapshot Reference: ompi_global_snapshot_7022.ckpt > Snapshot Ref.: 0 ompi_global_snapshot_7022.ckpt > --- > > > > Example: checkpointing with two running applications > > Similar to the first example, I'm again using the MPI-enabled flavor > of Povray as demo application. But now, I'm not only starting a single > Povray computation, but a second one in parallel. This gives me 8 MPI > processes (4 processes for each MPI job), so that the 8 cores of my > system are fully utilized > > Without checkpointing, these two processes are executing without any > problem, each job resulting in a Povray image. However, if I'm using > the ompi-checkpoint command for checkpointing one of these two jobs, > this ompi-checkpoint is in danger of not returning. > > Again I'm executing > > ompi-checkpoint -v --term 13572 > > (where 13752 is the PID of the mpirun process). This command gives > the following output, not returning back to the user: > > --- > [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing... > [grid-demo-1.cit.tu-berlin.de:14252] PID 13572 > [grid-demo-1.cit.tu-berlin.de:14252] Connected to Mpirun [[9529,0],0] > [grid-demo-1.cit.tu-berlin.de:14252] Terminating after checkpoint > [grid-demo-1.cit.tu-berlin.de:14252]
[OMPI users] unsubscibe
-- There is much pleasure to be gained from useless knowledge. Bertrand. P. S. Russell TROSY-NMR Lab, Singapore.
[OMPI users] ompi-checkpoint is hanging
Hi! I'm using the development version of OMPI from SVN (rev. 19857) for executing MPI jobs on my cluster system. I'm particularly using the checkpoint and restart feature, basing on the currentmost version of BLCR. The checkpointing is working pretty fine as long as I only execute a single job on a node. If more than one MPI application is executing on a system, ompi-checkpoint sometimes does not return, hanging forever. Example: checkpointing with a single running application I'm using the MPI-enabled flavor of Povray as demo application. So I'm starting it on a node using the following command. mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \ +O planet.tga This gives me 4 MPI processes, all running on the local node. checkpointing it with ompi-checkpoint -v --term 7022 (where 7022 is the PID of the mpirun process) gives me a checkpoint dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting the job. The ompi-checkpoint command gives the following output: --- [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:07480] PID 7022 [grid-demo-1.cit.tu-berlin.de:07480] Connected to Mpirun [[2899,0],0] [grid-demo-1.cit.tu-berlin.de:07480] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7022 [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Running - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Finished - Global Snapshot Reference: ompi_global_snapshot_7022.ckpt Snapshot Ref.: 0 ompi_global_snapshot_7022.ckpt --- Example: checkpointing with two running applications Similar to the first example, I'm again using the MPI-enabled flavor of Povray as demo application. But now, I'm not only starting a single Povray computation, but a second one in parallel. This gives me 8 MPI processes (4 processes for each MPI job), so that the 8 cores of my system are fully utilized Without checkpointing, these two processes are executing without any problem, each job resulting in a Povray image. However, if I'm using the ompi-checkpoint command for checkpointing one of these two jobs, this ompi-checkpoint is in danger of not returning. Again I'm executing ompi-checkpoint -v --term 13572 (where 13752 is the PID of the mpirun process). This command gives the following output, not returning back to the user: --- [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:14252] PID 13572 [grid-demo-1.cit.tu-berlin.de:14252] Connected to Mpirun [[9529,0],0] [grid-demo-1.cit.tu-berlin.de:14252] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13572 [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Pending (Termination) - Global Snapshot Reference: (null)
Re: [OMPI users] Equivalent .h files
The Open MPI that ships with Leopard does not include Fortran support because OS X does not ship with a Fortran compiler (this was Apple's decision, not ours). If you have Fortran MPI applications, you'll need to a) download and install your own Fortran compiler (e.g., http://hpc.sf.net/) , and b) install your own copy Open MPI that includes Fortran support (e.g., install it to /opt/openmpi or somesuch -- I do not recommend installing it over the system-installed Open MPI). Once you do this, mpif90 should work as expected, and statements like "use mpi" or "include "mpifh."" should function properly. On Oct 31, 2008, at 5:48 AM, Benjamin Lamptey wrote: Hello again, I have to be more specific with my problem. 1) I am using the Mac OS X (Leopard) operating system. When I do uname -a, I get Darwin Kernel Version 9.5.0 2) My code if fortran 90 3) I tried using the mpif90 wrapper and I got the following message x mpif90 -c -O3 /Users/lamptey/projectb/src/blag_real_burnmpi.f90 -- Unfortunately, this installation of Open MPI was not compiled with Fortran 90 support. As such, the mpif90 compiler is non-functional. -- make: *** [blag_real_burnmpi.o] Error 1 x 4) I have the g95 compiler installed. So when I try using the g95, (with include "mpif.h" or 'mpif.h'), I get the following mesage: xx g95 -fno-pic -c -O3 /Users/lamptey/projectb/src/ blag_real_burnmpi.f90 Error: Can't open included file 'mpif.h' make: *** [blag_real_burnmpi.o] Error 1 xxx 5) What are people's experience in this case? Thanks Ben On Thu, Oct 30, 2008 at 2:33 PM, Benjamin Lampteywrote: Hello, I am new at using open-mpi and will like to know something basic. What is the equivalent of the "mpif.h" in open-mpi which is normally "included" at the beginning of mpi codes (fortran in this case). I shall appreciate that for cpp as well. Thanks Ben ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Equivalent .h files
Hello again, I have to be more specific with my problem. 1) I am using the Mac OS X (Leopard) operating system. When I do uname -a, I get Darwin Kernel Version 9.5.0 2) My code if fortran 90 3) I tried using the mpif90 wrapper and I got the following message x mpif90 -c -O3 /Users/lamptey/projectb/src/blag_real_burnmpi.f90 -- Unfortunately, this installation of Open MPI was not compiled with Fortran 90 support. As such, the mpif90 compiler is non-functional. -- make: *** [blag_real_burnmpi.o] Error 1 x 4) I have the g95 compiler installed. So when I try using the g95, (with include "mpif.h" or 'mpif.h'), I get the following mesage: xx g95 -fno-pic -c -O3 /Users/lamptey/projectb/src/blag_real_burnmpi.f90 Error: Can't open included file 'mpif.h' make: *** [blag_real_burnmpi.o] Error 1 xxx 5) What are people's experience in this case? Thanks Ben On Thu, Oct 30, 2008 at 2:33 PM, Benjamin Lampteywrote: > Hello, > I am new at using open-mpi and will like to know something basic. > > What is the equivalent of the "mpif.h" in open-mpi which is normally > "included" at > the beginning of mpi codes (fortran in this case). > > I shall appreciate that for cpp as well. > > Thanks > Ben >
Re: [OMPI users] Openmpi ver1.3beta1
When you typed the --host x1 command, were you sitting on x1? Likewise, when you typed the --host x2 command, were you not on host x2? If the answer to both questions is "yes", then my guess is that something is preventing you from launching a daemon on host x2. Try adding --leave-session-attached to your cmd line and see if any error messages appear. And check the FAQ for tips on how to setup for ssh launch (I'm assuming that is what you are using). http://www.open-mpi.org/faq/?category=rsh Ralph On Oct 31, 2008, at 12:06 AM, Allan Menezes wrote: Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with- threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Openmpi ver1.3beta1
Hi, I built open mpi version 1.3b1 withe following cofigure command: ./configure --prefix=/opt/openmpi13b1 --enable-mpi-threads --with-threads=posix --disable-ipv6 I have six nodes x1..6 I distributed the /opt/openmpi13b1 with scp to all other nodes from the head node When i run the following command: mpirun --prefix /opt/openmpi13b1 --host x1 hostname it works on x1 printing out the hostname of x1 But when i type mpirun --prefix /opt/openmpi13b1 --host x2 hostname it hangs and does not give me any output I have a 6 node intel quad core cluster with OSCAR and pci express gigabit ethernet for eth0 Can somebody advise? Thank you very much. Allan Menezes