Re: [OMPI users] import/export issues on Windows
Hi Ben, Sorry for response late. The preprocessor problem is solved now, I don't know why Intel compiler doesn't accept such kind of definitions. But if you use the latest trunk, it should work. I'm working on the import/export problem, and trying to fix it with a better mechanism. I'll let you know when it's ready. Thanks, Shiqing On 2010-4-19 11:00 AM, ben.kupp...@shell.com wrote: Shiqing, I am having more import/export issues once I start using the openmpi binaries that I built with the Microsoft compiler. I get unresolved symbol errors for MPI::Comm::Comm and for MPI::Datatype::Free when I link our own program. The C functions MPI_Comm_create and MPI_Type_free are exported but the C++ equivalents apparently are not. Our source code builds and runs without issues with the Linux version of openmpi. Do you have any suggestions? -Ben *From:* Shiqing Fan [mailto:f...@hlrs.de] *Sent:* Friday, April 16, 2010 10:59 AM *To:* Open MPI Users *Cc:* Kuppers, Ben SIEP-PTT/SDRM *Subject:* Re: [OMPI users] import/export issues on Windows Hi Ben, I believe changing OMPI_DECLSPEC to __declspec(dllexport) inside functions.h will allow the cxx module to build (and export the function) but will break any client using (and thus trying to import) it. OMPI_DECLSPEC should only be defined as __declspec(dllexport) while compiling the cxx module (say when libmpi_cxx_EXPORTS is defined). Yes, as long as there are more functions to export, they have to be defined in that way. I don't see any option for Intel Compiler to manage this automatically. BTW, I also noticed that the Intel compiler has issues with the preprocessor definitions for ompi_info "OMPI_CONFIGURE_DATE=\"03:18 PM Wed 04/14/2010 \"" and "OMPI_BUILD_DATE=\"03:18 PM Wed 04/14/2010 \"". The quotes around the definitions throw it off completely. Is that something that CMake does or do you instruct CMake to do this? Both the Intel and Microsoft compiler work correctly without them. In which project did you see those preprocessor definitions? Because for me, I don't see them. Actually, they are not used as preprocessors in the whole solution, but they are only some cached variables in CMake. Could you please try to do a clean configuration with CMake, and see if they still exists? Thanks, Shiqing Thanks, Ben ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- -- Shiqing Fan http://www.hlrs.de/people/fan High Performance Computing Tel.: +49 711 685 87234 Center Stuttgart (HLRS)Fax.: +49 711 685 65832 Address:Allmandring 30 email: f...@hlrs.de 70569 Stuttgart
Re: [OMPI users] import/export issues on Windows
Thank you Shiqing. -Ben From: Shiqing Fan [mailto:f...@hlrs.de] Sent: Wednesday, April 21, 2010 11:30 AM To: Open MPI Users Cc: Kuppers, Ben SIEP-PTT/SDRM Subject: Re: [OMPI users] import/export issues on Windows Hi Ben, Sorry for response late. The preprocessor problem is solved now, I don't know why Intel compiler doesn't accept such kind of definitions. But if you use the latest trunk, it should work. I'm working on the import/export problem, and trying to fix it with a better mechanism. I'll let you know when it's ready. Thanks, Shiqing On 2010-4-19 11:00 AM, ben.kupp...@shell.com wrote: Shiqing, I am having more import/export issues once I start using the openmpi binaries that I built with the Microsoft compiler. I get unresolved symbol errors for MPI::Comm::Comm and for MPI::Datatype::Free when I link our own program. The C functions MPI_Comm_create and MPI_Type_free are exported but the C++ equivalents apparently are not. Our source code builds and runs without issues with the Linux version of openmpi. Do you have any suggestions? -Ben
[OMPI users] Totalview ( tvscript ) & Open MPI problem with memory debugging
( For information only ) I reported a bug which stops Totalview saving .mdbg files using tvscript - mpirun -V mpirun (Open MPI) 1.4 totalview -vLinux x86 TotalView 8.8.0-0 This is logged ( by Totalview )as CR 12192 - Tvscript fails to save mdbg file when specified with line number action point during OpenMPI debugging Running this script ( under Sun grid engine )- tvscript -verbosity=info -memory_debugging \ -create_actionpoint "mpiPi.F#92=>save_memory_debugging_file" \ -stdin ../_pi_ -stdout mpi_pi.log \ -mpi "Open MPI" -np $NSLOTS $XE/mpiPi - gives an error ( for each process ) Thread 1.1 hit breakpoint 3 at line 92 in "MAIN" ERROR: Failed call to action handler with command 'memory_action_save_memory_debugging_file {options {} actionpoint_id 3 thread_id 1.1 event actionpoint actionpoint_source_loc_expr 92 process_id 1}' (expected integer but got "1190@145.239.47.142") ( & no .mdgb files.. ) Non-memory options ( eg traceback ) appear to work OK With Totalview 8.6 & OpenMPI 1.3.3 I got ERROR: Unknown event while trying to notify handlers Killed slave process 4, named "mpiPi" when trying to launch the processes, so 8.8/1.4 is a definite improvement.. Jim Conboy ( Culham Centre for Fusion Energy )
Re: [OMPI users] OS X - Can't find the absoft directory
On Apr 20, 2010, at 7:03 PM, Paul Cizmas wrote: > Is it possible to have two openmpi-s on the same computer? Yes. As an OMPI developer, I have dozens of different OMPI installs on my cluster (for various stages of development and testing, etc.). The only real issue is to ensure that you set your PATH (and possibly LD_LIBRARY_PATH) correctly to point to the one that you want. If you use the --enable-mpirun-prefix-by-default configure option, then you don't need to worry about paths on the remote nodes. > I have > openmpi 1.3.2 working fine with gfortran but I cannot build openmpi > 1.4.1 with Absoft - I get this message from libtool: > > /bin/sh ../../../libtool --mode=compile /Applications/Absoft11.0/bin/ > f90 -I../../../ompi/include -I../../../ompi/include -p. -I. -I../../../ > ompi/mpi/f90 -lU77 -c -o mpi.lo mpi.f90 > libtool: compile: /Applications/Absoft11.0/bin/f90 -I../../../ompi/ > include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 - > c mpi.f90 -o .libs/mpi.o > Can't find the absoft directory. > Please set the ABSOFT environment variable and try again. > make[4]: *** [mpi.lo] Error 1 > > Note that ABSOFT is properly set as in fact shown above on the first > line. In addition, the absolute address of the f90 (/Applications/ > Absoft11.0/bin/f90) is correct. > > To recreate the problem I went to folder openmpi-1.4.1/ompi/mpi/f90, > checked again ABSOFT variable and called libtool. The result is > obviously the same: > > sudo /bin/sh ../../../libtool --mode=compile /Applications/ > Absoft11.0/bin/f90 -I../../../ompi/include -I../../../ompi/include -p. > -I. -I../../../ompi/mpi/f90 -lU77 -c -o mpi.lo mpi.f90 > Password: Why are you sudo'ing here? We just had another user on the list have a problem compiling because they were doing "sudo make all" instead of just "make all". I'm not sure what exactly happened, but "sudo make all" apparently had some weird side effects whereas "make all" did not. FWIW: Most users compile Open MPI as a non-privlidged user and then only use sudo to "make install". -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] OS X - Can't find the absoft directory
Jeff: Thank you! Paul On Apr 21, 2010, at 7:41 AM, Jeff Squyres wrote: > On Apr 20, 2010, at 7:03 PM, Paul Cizmas wrote: > >> Is it possible to have two openmpi-s on the same computer? > > Yes. As an OMPI developer, I have dozens of different OMPI installs on my > cluster (for various stages of development and testing, etc.). The only real > issue is to ensure that you set your PATH (and possibly LD_LIBRARY_PATH) > correctly to point to the one that you want. If you use the > --enable-mpirun-prefix-by-default configure option, then you don't need to > worry about paths on the remote nodes. > >> I have >> openmpi 1.3.2 working fine with gfortran but I cannot build openmpi >> 1.4.1 with Absoft - I get this message from libtool: >> >> /bin/sh ../../../libtool --mode=compile /Applications/Absoft11.0/bin/ >> f90 -I../../../ompi/include -I../../../ompi/include -p. -I. -I../../../ >> ompi/mpi/f90 -lU77 -c -o mpi.lo mpi.f90 >> libtool: compile: /Applications/Absoft11.0/bin/f90 -I../../../ompi/ >> include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 - >> c mpi.f90 -o .libs/mpi.o >> Can't find the absoft directory. >> Please set the ABSOFT environment variable and try again. >> make[4]: *** [mpi.lo] Error 1 >> >> Note that ABSOFT is properly set as in fact shown above on the first >> line. In addition, the absolute address of the f90 (/Applications/ >> Absoft11.0/bin/f90) is correct. >> >> To recreate the problem I went to folder openmpi-1.4.1/ompi/mpi/f90, >> checked again ABSOFT variable and called libtool. The result is >> obviously the same: >> >> sudo /bin/sh ../../../libtool --mode=compile /Applications/ >> Absoft11.0/bin/f90 -I../../../ompi/include -I../../../ompi/include -p. >> -I. -I../../../ompi/mpi/f90 -lU77 -c -o mpi.lo mpi.f90 >> Password: > > Why are you sudo'ing here? > > We just had another user on the list have a problem compiling because they > were doing "sudo make all" instead of just "make all". I'm not sure what > exactly happened, but "sudo make all" apparently had some weird side effects > whereas "make all" did not. > > FWIW: Most users compile Open MPI as a non-privlidged user and then only use > sudo to "make install". > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] 'readv failed: Connection timed out' issue
On Apr 20, 2010, at 8:55 AM, Jonathan Dursi wrote: > We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system. We > like OpenMPI for large jobs, because the startup time is much faster (and > startup is more reliable) than the current defaults with IntelMPI; but we're > having some pretty serious problems when the jobs are actually running. > When running medium- to large- sized jobs (say, anything over 500 cores) over > ethernet using OpenMPI, several of our users, using a variety of very > different sorts of codes, report errors like this: > > [gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) That's odd -- this error message indicates that the TCP BTL had previously successfully established the connection and was trying to receive an MPI message on the socket. But then reading from the socket timed out. Hmm. > which sometimes hang the job, or sometimes kill it outright: > > [gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) > [gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) This one is a little different -- it indicates that n075 saw its peer hang up the socket, which it determined to be a fatal error and therefore aborted. (sidenote: I'm not sure why we don't judge a connection timeout to be the same fatal error... hmmm...) It's likely that n073 saw the timeout, closed the socket, and then n075 saw that hangup. It might not happen in all scenarios, because n075 might not see the hangup unless it's actively trying to read something from that peer's socket. Regardless, the real question is: why is the socket timing out? > Unfortunately, this only happens intermittently, and only with large jobs, so > it is hard to track down.It seems to happen more reliably with larger > numbers of processors, but I don't know if that tells us something real about > the issue, or just that larger N -> better statistics. For one users > code, it definitely occurs during an MPI_Wait (this particular code has been > run on a wide variety of machines with a wide variety of MPIs -- which isn't > proof of correctness of course, but everything looks fine), for others it is > less clear. I think it's reasonable to see this in MPI_Wait -- it means that OMPI was notified that there was something to read of a particular socket file descriptor and was trying to read it (and then timed out). I'll bet that the others all died in some kind of communication with a specific peer (regardless of whether it was in a collective or point-to-point communication call). > I don't know if it's an OpenMPI issue, or just represents a network issue > which Intel's MPI happens to be more tolerant of with the default set of > parameters. It's also unclear whether or not this issue occurred with > earlier OpenMPI versions. > > Where should I start looking to find out what is going on? Are there > parameters that can be adjusted to play with timeouts to see if the issue can > be localized, or worked around? Can you see if there's any kernel parameters to adjust how many fd's you can have open simultaneously, and the length of TCP socket timeouts? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
O n Tue, 2010-04-20 at 20:22 -0400, Jeff Squyres wrote: > On Apr 20, 2010, at 6:16 PM, Nev wrote: > > > Hi Jeff, > > I did the install to the "same place". I always use /opt/openmpi, the > > procedure I use when building is > > configure --prefix=/opt/openmpi ... > > rm -r /opt/openmpi/* > > make clean > > make all > > make install > > is this sufficient to un-install previous version, or is more required. > > Yes, that should be sufficient. Is that what you did this time? > > If so, is there any way you can provide a small code example of the problem > you're seeing? > OK, I will attempt to reduce to minimal code set, but will not be able to do so until the week end.