Re: [OMPI users] shared memory (sm) module not working properly?
Dunno. Do lower np values succeed? If so, at what value of np does the job no longer start? Perhaps it's having a hard time creating the shared-memory backing file in /tmp. I think this is a 64-Mbyte file. If this is the case, try reducing the size of the shared area per this FAQ item: http://www.open-mpi.org/faq/?category=sm#decrease-sm Most notably, reduce mpool_sm_min_size below 67108864. Also note trac ticket 2043, which describes problems with the sm BTL exposed by GCC 4.4.x compilers. You need to get a sufficiently recent build to solve this. But, those problems don't occur until you start passing messages, and here you're not even starting up. Nicolas Bock wrote: Sorry, I forgot to give more details on what versions I am using: OpenMPI 1.4 Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1 On Fri, Jan 15, 2010 at 15:47, Nicolas Bockwrote: Hello list, I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores, which I can verify by looking at /proc/cpuinfo. However, when I run a job with mpirun -np 16 -mca btl self,sm job I get this error: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[56972,2],0]) is on host: rust Process 2 ([[56972,1],0]) is on host: rust BTLs attempted: self sm Your MPI job is now going to abort; sorry. -- By adding the tcp btl I can run the job. I don't understand why openmpi claims that a pair of processes can not reach each other, all processor cores should have access to all memory after all. Do I need to set some other btl limit?
Re: [OMPI users] shared memory (sm) module not working properly?
Sorry, I forgot to give more details on what versions I am using: OpenMPI 1.4 Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1 On Fri, Jan 15, 2010 at 15:47, Nicolas Bock wrote: > Hello list, > > I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores, > which I can verify by looking at /proc/cpuinfo. However, when I run a job > with > > mpirun -np 16 -mca btl self,sm job > > I get this error: > > -- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[56972,2],0]) is on host: rust > Process 2 ([[56972,1],0]) is on host: rust > BTLs attempted: self sm > > Your MPI job is now going to abort; sorry. > -- > > By adding the tcp btl I can run the job. I don't understand why openmpi > claims that a pair of processes can not reach each other, all processor > cores should have access to all memory after all. Do I need to set some > other btl limit? > > nick > >
[OMPI users] shared memory (sm) module not working properly?
Hello list, I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores, which I can verify by looking at /proc/cpuinfo. However, when I run a job with mpirun -np 16 -mca btl self,sm job I get this error: -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[56972,2],0]) is on host: rust Process 2 ([[56972,1],0]) is on host: rust BTLs attempted: self sm Your MPI job is now going to abort; sorry. -- By adding the tcp btl I can run the job. I don't understand why openmpi claims that a pair of processes can not reach each other, all processor cores should have access to all memory after all. Do I need to set some other btl limit? nick
Re: [OMPI users] dynamic rules
I tried this and it still crashes with openmpi-1.4. Is it supposed to work with openmpi-1.4 or do I need to compile openmpi-1.4.1 ? Terribly sorry, I should checked my own notes thoroughly before giving others advice. One needs to give the dynamic rules file location on the command line: mpirun -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_dynamic_rules_filename /home/.openmpi/dynamic_rules_file That works for me with openmpi 1.4. I have not tried 1.4.1 yet. Daniel
Re: [OMPI users] Checkpoint/Restart error
It's almost midnight here, so I left home, but I will try it tomorrow. There were some directories left after "make uninstall". I will give more details tomorrow. Thanks Jeff, Andreea On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote: > On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > > > - I wanted to update to version 1.4.1 and I uninstalled previous version > like this: make uninstall, and than manually deleted all the left over > files. the directory where I installed was /usr/local > > I'll let Josh answer your CR questions, but I did want to ask about this > point. AFAIK, "make uninstall" removes *all* Open MPI files. For example: > > - > [7:25] $ cd /path/to/my/OMPI/tree > [7:25] $ make install > /dev/null > [7:26] $ find /tmp/bogus/ -type f | wc >646 646 28082 > [7:26] $ make uninstall > /dev/null > [7:27] $ find /tmp/bogus/ -type f | wc > 0 0 0 > [7:27] $ > - > > I realize that some *directories* are left in $prefix, but there should be > no *files* left. Are you seeing something different? > > -- > Jeff Squyres > jsquy...@cisco.com > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Checkpoint/Restart error
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > - I wanted to update to version 1.4.1 and I uninstalled previous version like > this: make uninstall, and than manually deleted all the left over files. the > directory where I installed was /usr/local I'll let Josh answer your CR questions, but I did want to ask about this point. AFAIK, "make uninstall" removes *all* Open MPI files. For example: - [7:25] $ cd /path/to/my/OMPI/tree [7:25] $ make install > /dev/null [7:26] $ find /tmp/bogus/ -type f | wc 646 646 28082 [7:26] $ make uninstall > /dev/null [7:27] $ find /tmp/bogus/ -type f | wc 0 0 0 [7:27] $ - I realize that some *directories* are left in $prefix, but there should be no *files* left. Are you seeing something different? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] dynamic rules
>I have done this according to suggestion on this list, until a fix comes >that makes it possible to change via command line: > >To choose bruck for all message sizes / mpi sizes with openmpi-1.4 > >File $HOME/.openmpi/mca-params.conf (replace /homeX) so it points to >the correct file: >coll_tuned_use_dynamic_rules=1 >coll_tuned_dynamic_rules_filename="/home/.openmpi/dynamic_rules_file" > ... I tried this and it still crashes with openmpi-1.4. Is it supposed to work with openmpi-1.4 or do I need to compile openmpi-1.4.1 ? Best regards Roman
Re: [OMPI users] Checkpoint/Restart error
I don't know what else should I try... because it worked on 1.3.3 doing exactly the same steps. I tried to install it both with an active eth interface and an inactive one. I am running on a virtual machine that has CentOS as OS. Any suggestions? Thanks, Andreea On Fri, Jan 15, 2010 at 9:07 PM, Andreea Costea wrote: > I tried the new version, that was uploaded today. I still have that error, > just that now is at line 405 instead of 399. > > Maybe if I give more details: > - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, > ompi-checkpoint and ompi-restart worked with that version. > - I wanted to update to version 1.4.1 and I uninstalled previous version > like this: make uninstall, and than manually deleted all the left over > files. the directory where I installed was /usr/local > - I installed 1.4.1 in the same directory: /usr/locale. paths set > correctly to usr/local/bin and /usr/local/lib > - mpirun works, ompi-checkpoint gives the following error: > [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line > 405 > HNP with PID 7899 Not found! > > I would appreciate any help, > Andreea > > > > On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote: > >> Hi... >> still not working. Though I uninstalled OpenMPI with make uninstall and I >> manually deleted all other files, I still have the same error when >> checkpointing. >> >> Any idea? >> >> Thanks, >> Andreea >> >> >> >> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: >> >>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: >>> >>> > Hi, >>> > >>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have >>> downloaded today. When I want to checkpoint I am having the following error >>> message: >>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at >>> line 399 >>> > HNP with PID 2337 Not found! >>> >>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is >>> no error at line 399, in 1.3.3 there is. Check your installation of Open >>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected >>> problems. >>> >>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in >>> your path/lib_path any longer. >>> >>> -- Josh >>> >>> > >>> > I tried the same thing with version 1.3.3 and it works perfectly. >>> > >>> > Any idea why? >>> > >>> > thanks, >>> > Andreea >>> > ___ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >
Re: [OMPI users] dynamic rules
I have done this according to suggestion on this list, until a fix comes that makes it possible to change via command line: To choose bruck for all message sizes / mpi sizes with openmpi-1.4 File $HOME/.openmpi/mca-params.conf (replace /homeX) so it points to the correct file: coll_tuned_use_dynamic_rules=1 coll_tuned_dynamic_rules_filename="/home/.openmpi/dynamic_rules_file" file $HOME/.openmpi/dynamic_rules_file: 1 # num of collectives 3 # ID = 3 Alltoall collective (ID in coll_tuned.h) 1 # number of com sizes 0 # comm size 1 # number of msg sizes 0 3 0 0 # for message size 0, bruck, topo 0, 0 segmentation # end of collective rule Change the number 3 to something else for other algoritms (can be found with ompi_info -a for example): MCA coll: information "coll_tuned_alltoall_algorithm_count" (value: "4") Number of alltoall algorithms available MCA coll: parameter "coll_tuned_alltoall_algorithm" (current value: "0") Which alltoall algorithm is used. Can be locked down to choice of: 0 ignore, 1 basic linear, 2 pairwise, 3: modified bruck, 4: two proc only. HTH Daniel Spångberg Den 2010-01-15 13:54:33 skrev Roman Martonak : On my machine I need to use dynamic rules to enforce the bruck or pairwise algorithm for alltoall, since unfortunately the default basic linear algorithm performs quite poorly on my Infiniband network. Few months ago I noticed that in case of VASP, however, the use of dynamic rules via --mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_dynamic_rules_filename dyn_rules has no effect at all. Later it was identified that there was a bug causing the dynamic rules to apply only to the MPI_COMM_WORLD but not to other communicators. As far as I understand, the bug was fixed in openmpi-1.3.4. I tried now the openmpi-1.4 version and expected that tuning of alltoall via dynamic rules would work, but there is still no effect at all. Even worse, now it is not even possible to use static rules (which worked previously) such as -mca coll_tuned_alltoall_algorithm 3, because the code would crash (as discussed in the list recently). When running with --mca coll_base_verbose 1000, I get messages like [compute-0-0.local:08011] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself [compute-0-0.local:08011] coll:base:comm_select: component not available: sm [compute-0-0.local:08011] coll:base:comm_select: component available: sync, priority: 50 [compute-0-3.local:26116] coll:base:comm_select: component available: self, priority: 75 [compute-0-3.local:26116] coll:sm:comm_query (1/MPI_COMM_SELF): intercomm, comm is too small, or not all peers local; disqualifying myself [compute-0-3.local:26116] coll:base:comm_select: component not available: sm [compute-0-3.local:26116] coll:base:comm_select: component available: sync, priority: 50 [compute-0-3.local:26116] coll:base:comm_select: component not available: tuned [compute-0-0.local:08011] coll:base:comm_select: component available: tuned, priority: 30 Is there now a way to use other alltoall algorithms instead of the basic linear algorithm in openmpi-1.4.x ? Thanks in advance for any suggestion. Best regards Roman Martonak ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Daniel Spångberg Materialkemi Uppsala Universitet
Re: [OMPI users] Rapid I/O support
On Jan 14, 2010, at 3:08 PM, Jeff Squyres wrote: On Jan 14, 2010, at 1:59 PM, TONY BASIL wrote: I am doing a project with an HPC set up on multicore Power PC..Nodes will be connected using Rapid I/O instead for Gigabit Ethernet...I would like to know if OpenMPI supports Rapid I/O... I'm afraid not. Before your post, I had never heard of Rapid IO. Likewise. Does it support Ethernet encapsulation over it? If so, try Open-MX. Scott
Re: [OMPI users] More NetBSD fixes
On Thu, 14 Jan 2010 21:55:06 -0500, Jeff Squyres wrote: > That being said, you could sign up on it and then set your membership to > receive no mail...? This is especially dangerous because the Open MPI lists munge the Reply-To header, which is a bad thing http://www.unicom.com/pw/reply-to-harmful.html But lots of mailers have poor default handling of mailing lists, so it's complicated. With munging, a mailer's "reply-to-sender" function will send mail *only* to the list and "reply-to-all" will send it to the list and any other recipients, but *not* the sender (unless the mailer does special detection of munged reply-to headers). This makes it rather difficult to participate in a discussion without receiving mail from the list, or even to reliably filter list traffic (you have to write filter rules that walk the References tree to find if it is something that would be interesting to you, and then you get false positives from people who reply to an existing thread when they wanted to make a new thread). Jed
Re: [OMPI users] Checkpoint/Restart error
I tried the new version, that was uploaded today. I still have that error, just that now is at line 405 instead of 399. Maybe if I give more details: - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, ompi-checkpoint and ompi-restart worked with that version. - I wanted to update to version 1.4.1 and I uninstalled previous version like this: make uninstall, and than manually deleted all the left over files. the directory where I installed was /usr/local - I installed 1.4.1 in the same directory: /usr/locale. paths set correctly to usr/local/bin and /usr/local/lib - mpirun works, ompi-checkpoint gives the following error: [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 HNP with PID 7899 Not found! I would appreciate any help, Andreea On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote: > Hi... > still not working. Though I uninstalled OpenMPI with make uninstall and I > manually deleted all other files, I still have the same error when > checkpointing. > > Any idea? > > Thanks, > Andreea > > > > On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: > >> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: >> >> > Hi, >> > >> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have >> downloaded today. When I want to checkpoint I am having the following error >> message: >> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at >> line 399 >> > HNP with PID 2337 Not found! >> >> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is >> no error at line 399, in 1.3.3 there is. Check your installation of Open >> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected >> problems. >> >> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in >> your path/lib_path any longer. >> >> -- Josh >> >> > >> > I tried the same thing with version 1.3.3 and it works perfectly. >> > >> > Any idea why? >> > >> > thanks, >> > Andreea >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
[OMPI users] dynamic rules
On my machine I need to use dynamic rules to enforce the bruck or pairwise algorithm for alltoall, since unfortunately the default basic linear algorithm performs quite poorly on my Infiniband network. Few months ago I noticed that in case of VASP, however, the use of dynamic rules via --mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_dynamic_rules_filename dyn_rules has no effect at all. Later it was identified that there was a bug causing the dynamic rules to apply only to the MPI_COMM_WORLD but not to other communicators. As far as I understand, the bug was fixed in openmpi-1.3.4. I tried now the openmpi-1.4 version and expected that tuning of alltoall via dynamic rules would work, but there is still no effect at all. Even worse, now it is not even possible to use static rules (which worked previously) such as -mca coll_tuned_alltoall_algorithm 3, because the code would crash (as discussed in the list recently). When running with --mca coll_base_verbose 1000, I get messages like [compute-0-0.local:08011] coll:sm:comm_query (0/MPI_COMM_WORLD): intercomm, comm is too small, or not all peers local; disqualifying myself [compute-0-0.local:08011] coll:base:comm_select: component not available: sm [compute-0-0.local:08011] coll:base:comm_select: component available: sync, priority: 50 [compute-0-3.local:26116] coll:base:comm_select: component available: self, priority: 75 [compute-0-3.local:26116] coll:sm:comm_query (1/MPI_COMM_SELF): intercomm, comm is too small, or not all peers local; disqualifying myself [compute-0-3.local:26116] coll:base:comm_select: component not available: sm [compute-0-3.local:26116] coll:base:comm_select: component available: sync, priority: 50 [compute-0-3.local:26116] coll:base:comm_select: component not available: tuned [compute-0-0.local:08011] coll:base:comm_select: component available: tuned, priority: 30 Is there now a way to use other alltoall algorithms instead of the basic linear algorithm in openmpi-1.4.x ? Thanks in advance for any suggestion. Best regards Roman Martonak
Re: [OMPI users] Windows CMake build problems ... (cont.)
Hi Charlie, Glad to hear that you compiled it successfully. The error you got with 1.3.4 is a bug that the CMake script didn't set the SVN information correctly, and it has been fixed in 1.4 and later. Thanks, Shiqing cjohn...@valverdecomputing.com wrote: Yes that was it. A much improved result now from CMake 2.6.4, no errors from compiling openmpi-1.4: 1>libopen-pal - 0 error(s), 9 warning(s) 2>libopen-rte - 0 error(s), 7 warning(s) 3>opal-restart - 0 error(s), 0 warning(s) 4>opal-wrapper - 0 error(s), 0 warning(s) 5>libmpi - 0 error(s), 42 warning(s) 6>orte-checkpoint - 0 error(s), 0 warning(s) 7>orte-ps - 0 error(s), 0 warning(s) 8>orted - 0 error(s), 0 warning(s) 9>orte-clean - 0 error(s), 0 warning(s) 10>orterun - 0 error(s), 3 warning(s) 11>ompi_info - 0 error(s), 0 warning(s) 12>ompi-server - 0 error(s), 0 warning(s) 13>libmpi_cxx - 0 error(s), 61 warning(s) == Build: 13 succeeded, 0 failed, 1 up-to-date, 0 skipped == And only one failure from compiling openmpi-1.3.4 (the ompi_info project): > 1>libopen-pal - 0 error(s), 9 warning(s) > 2>libopen-rte - 0 error(s), 7 warning(s) > 3>opal-restart - 0 error(s), 0 warning(s) > 4>opal-wrapper - 0 error(s), 0 warning(s) > 5>orte-checkpoint - 0 error(s), 0 warning(s) > 6>libmpi - 0 error(s), 42 warning(s) > 7>orte-ps - 0 error(s), 0 warning(s) > 8>orted - 0 error(s), 0 warning(s) > 9>orte-clean - 0 error(s), 0 warning(s) > 10>orterun - 0 error(s), 3 warning(s) > 11>ompi_info - 3 error(s), 0 warning(s) > 12>ompi-server - 0 error(s), 0 warning(s) > 13>libmpi_cxx - 0 error(s), 61 warning(s) > == Rebuild All: 13 succeeded, 1 failed, 0 skipped == Here's the listing from the non-linking project: 11>-- Rebuild All started: Project: ompi_info, Configuration: Debug Win32 -- 11>Deleting intermediate and output files for project 'ompi_info', configuration 'Debug|Win32' 11>Compiling... 11>version.cc 11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(136) : error C2059: syntax error : ',' 11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(147) : error C2059: syntax error : ',' 11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(158) : error C2059: syntax error : ',' 11>param.cc 11>output.cc 11>ompi_info.cc 11>components.cc 11>Generating Code... 11>Build log was saved at "file://c:\prog\mon\ompi\tools\ompi_info\ompi_info.dir\Debug\BuildLog.htm" 11>ompi_info - 3 error(s), 0 warning(s) Thank you Shiqing ! Charlie ... Original Message Subject: Re: [OMPI users] Windows CMake build problems ... (cont.) From: Shiqing Fan Date: Thu, January 14, 2010 11:20 am To: Open MPI Users , cjohn...@valverdecomputing.com Hi Charlie, The problem turns out to be the different behavior of one CMake macro in different version of CMake. And it's fixed in Open MPI trunk with r22405. I also created a ticket to move the fix over to 1.4 branch, see #2169: https://svn.open-mpi.org/trac/ompi/ticket/2169 . So you could either switch to use OMPI trunk or use CMake 2.6 to solve the problem. Thanks a lot. Best Regards, Shiqing cjohn...@valverdecomputing.com wrote: > The OpenMPI build problem I'm having occurs in both OpenMPI 1.4 and 1.3.4. > > I am on a Windows 7 (US) Enterprise (x86) OS on an HP system with > Intel core 2 extreme x9000 (4GB RAM), using the 2005 Visual Studio for > S/W Architects (release 8.0.50727.867). > > [That release has everything the platform SDK would have.] > > I'm using CMake 2.8 to generate code, I used it correctly, pointing at > the root directory where the makelists are located for the source side > and to an empty directory for the build side: did configure, _*I did > not click debug this time as suggested by Shiqing*_, configure again, > generate and opened the OpenMPI.sln file created by CMake. Then I > right-clicked on the "ALL_BUILD" project and selected "build". Then > did one "rebuild", just in case build order might get one more success > (which it seemed to, but I could not find). > > 2 projects built, 12 did not. I have the build listing. [I'm afraid of > what the mailing list server would do if I attached it to this email.] > > All the compiles were successful (warnings at most.) All the errors > were were from linking the VC projects: > > *1>libopen-pal - 0 error(s), 9 warning(s)* > 3>opal-restart - 32 error(s), 0 warning(s) > 4>opal-wrapper - 21 error(s), 0 warning(s) > 2>libopen-rte - 749 error(s), 7 warning(s) > 5>orte-checkpoint - 32 error(s), 0 warning(s) > 7>orte-ps - 28 error(s), 0 warning(s) > 8>orted - 2 error(s), 0 warning(s) > 9>orte-clean - 13 error(s), 0 warning(s) > 10>orterun - 100 error(s), 3 warning(s) > 6>libmpi - 2133 error(s), 42 warning(s) > 12>ompi-server - 27 error(s), 0 war
Re: [OMPI users] MPI debugger
On 11 Jan 2010, at 06:20, Jed Brown wrote: > On Sun, 10 Jan 2010 19:29:18 +, Ashley Pittman > wrote: >> It'll show you parallel stack traces but won't let you single step for >> example. > > Two lightweight options if you want stepping, breakpoints, watchpoints, > etc. > > * Use serial debuggers on some interesting processes, for example with > >mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n > 1 xterm -e gdb --args ./trouble args > > to put an xterm on rank 0 and 3 of a four process job (there are lots > of other ways to get here). You can also achieve something similar with padb by starting the job normally and then using padb to launch xterms in a similar manner although it's been pointed out to me that this only works with one process per node right now. > * MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you > to send the same gdb commands to each process and collate the output. True, I'd forgotten about that, the MPICH2 people are moving away from mpd though so I don't know how much longer that will be an option. Ashley,
[OMPI users] Open MPI v1.4.1 released
The Open MPI Team, representing a consortium of research, academic, and industry partners, is pleased to announce the release of Open MPI version 1.4.1. This release is strictly a bug fix release over the v1.4 release. Version 1.4.1 can be downloaded from the main Open MPI web site or any of its mirrors (mirrors will be updating shortly). Here is a list of changes in v1.4.1 as compared to v1.4 - Update to PLPA v1.3.2, addressing a licensing issue identified by the Fedora project. See https://svn.open-mpi.org/trac/plpa/changeset/262 for details. - Add check for malformed checkpoint metadata files (Ticket #2141). - Fix error path in ompi-checkpoint when not able to checkpoint (Ticket #2138). - Cleanup component release logic when selecting checkpoint/restart enabled components (Ticket #2135). - Fixed VT node name detection for Cray XT platforms, and fixed some broken VT documentation files. - Fix a possible race condition in tearing down RDMA CM-based connections. - Relax error checking on MPI_GRAPH_CREATE. Thanks to David Singleton for pointing out the issue. - Fix a shared memory "hang" problem that occurred on x86/x86_64 platforms when used with the GNU >=4.4.x compiler series. - Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler suite. Inspired directly from the upstream Libtool patches that fix the issue (but we need something working before the next Libtool release).
Re: [OMPI users] Windows CMake build problems ... (cont.)
Yes that was it.A much improved result now from CMake 2.6.4, no errors from compiling openmpi-1.4:1>libopen-pal - 0 error(s), 9 warning(s)2>libopen-rte - 0 error(s), 7 warning(s)3>opal-restart - 0 error(s), 0 warning(s)4>opal-wrapper - 0 error(s), 0 warning(s)5>libmpi - 0 error(s), 42 warning(s)6>orte-checkpoint - 0 error(s), 0 warning(s)7>orte-ps - 0 error(s), 0 warning(s)8>orted - 0 error(s), 0 warning(s)9>orte-clean - 0 error(s), 0 warning(s)10>orterun - 0 error(s), 3 warning(s)11>ompi_info - 0 error(s), 0 warning(s)12>ompi-server - 0 error(s), 0 warning(s)13>libmpi_cxx - 0 error(s), 61 warning(s)== Build: 13 succeeded, 0 failed, 1 up-to-date, 0 skipped ==And only one failure from compiling openmpi-1.3.4 (the ompi_info project):> 1>libopen-pal - 0 error(s), 9 warning(s)> 2>libopen-rte - 0 error(s), 7 warning(s)> 3>opal-restart - 0 error(s), 0 warning(s)> 4>opal-wrapper - 0 error(s), 0 warning(s)> 5>orte-checkpoint - 0 error(s), 0 warning(s)> 6>libmpi - 0 error(s), 42 warning(s)> 7>orte-ps - 0 error(s), 0 warning(s)> 8>orted - 0 error(s), 0 warning(s)> 9>orte-clean - 0 error(s), 0 warning(s)> 10>orterun - 0 error(s), 3 warning(s)> 11>ompi_info - 3 error(s), 0 warning(s)> 12>ompi-server - 0 error(s), 0 warning(s)> 13>libmpi_cxx - 0 error(s), 61 warning(s)> == Rebuild All: 13 succeeded, 1 failed, 0 skipped ==Here's the listing from the non-linking project:11>-- Rebuild All started: Project: ompi_info, Configuration: Debug Win32 --11>Deleting intermediate and output files for project 'ompi_info', configuration 'Debug|Win32'11>Compiling...11>version.cc11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(136) : error C2059: syntax error : ','11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(147) : error C2059: syntax error : ','11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(158) : error C2059: syntax error : ','11>param.cc11>output.cc11>ompi_info.cc11>components.cc11>Generating Code...11>Build log was saved at "file://c:\prog\mon\ompi\tools\ompi_info\ompi_info.dir\Debug\BuildLog.htm"11>ompi_info - 3 error(s), 0 warning(s)Thank you Shiqing !Charlie ... Original Message Subject: Re: [OMPI users] Windows CMake build problems ... (cont.) From: Shiqing Fan List-Post: users@lists.open-mpi.org Date: Thu, January 14, 2010 11:20 am To: Open MPI Users , cjohn...@valverdecomputing.com Hi Charlie, The problem turns out to be the different behavior of one CMake macro in different version of CMake. And it's fixed in Open MPI trunk with r22405. I also created a ticket to move the fix over to 1.4 branch, see #2169: https://svn.open-mpi.org/trac/ompi/ticket/2169 . So you could either switch to use OMPI trunk or use CMake 2.6 to solve the problem. Thanks a lot. Best Regards, Shiqing cjohn...@valverdecomputing.com wrote: > The OpenMPI build problem I'm having occurs in both OpenMPI 1.4 and 1.3.4. > > I am on a Windows 7 (US) Enterprise (x86) OS on an HP system with > Intel core 2 extreme x9000 (4GB RAM), using the 2005 Visual Studio for > S/W Architects (release 8.0.50727.867). > > [That release has everything the platform SDK would have.] > > I'm using CMake 2.8 to generate code, I used it correctly, pointing at > the root directory where the makelists are located for the source side > and to an empty directory for the build side: did configure, _*I did > not click debug this time as suggested by Shiqing*_, configure again, > generate and opened the OpenMPI.sln file created by CMake. Then I > right-clicked on the "ALL_BUILD" project and selected "build". Then > did one "rebuild", just in case build order might get one more success > (which it seemed to, but I could not find). > > 2 projects built, 12 did not. I have the build listing. [I'm afraid of > what the mailing list server would do if I attached it to this email.] > > All the compiles were successful (warnings at most.) All the errors > were were from linking the VC projects: > > *1>libopen-pal - 0 error(s), 9 warning(s)* > 3>opal-restart - 32 error(s), 0 warning(s) > 4>opal-wrapper - 21 error(s), 0 warning(s) > 2>libopen-rte - 749 error(s), 7 warning(s) > 5>orte-checkpoint - 32 error(s), 0 warning(s) > 7>orte-ps - 28 error(s), 0 warning(s) > 8>orted - 2 error(s), 0 warning(s) > 9>orte-clean - 13 error(s), 0 warning(s) > 10>orterun - 100 error(s), 3 warning(s) > 6>libmpi - 2133 error(s), 42 warning(s) > 12>ompi-server - 27 error(s), 0 warning(s) > 11>ompi_info - 146 error(s), 0 warning(s) > 13>libmpi_cxx - 456 error(s), 61 warning(s) > == Rebuild All: 2 succeeded, 12 failed, 0 skipped == > > It said that 2 succeeded, I could not find the second build success in > the listing. > > *However, everything did compile, and thank you Shiqing !* > > Here is the listing for the first failed link, on "opal-restart": > > 3>-- Rebuild All started: Project: opal-restart, Configuration: > Debug Win32 -- > 3>Deleting intermediate and output fil
Re: [OMPI users] Checkpoint/Restart error
Hi... still not working. Though I uninstalled OpenMPI with make uninstall and I manually deleted all other files, I still have the same error when checkpointing. Any idea? Thanks, Andreea On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: > On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > > > Hi, > > > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line > 399 > > HNP with PID 2337 Not found! > > This looks like an error coming from the 1.3.3 install. In 1.4.1 there is > no error at line 399, in 1.3.3 there is. Check your installation of Open > MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected > problems. > > Try a clean installation of 1.4.1 and double check that 1.3.3 is not in > your path/lib_path any longer. > > -- Josh > > > > > I tried the same thing with version 1.3.3 and it works perfectly. > > > > Any idea why? > > > > thanks, > > Andreea > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >