Re: [OMPI users] import/export issues on Windows

2010-04-21 Thread Shiqing Fan


Hi Ben,

Sorry for response late.

The preprocessor problem is solved now, I don't know why Intel compiler 
doesn't accept such kind of definitions. But if you use the latest 
trunk, it should work.


I'm working on the import/export problem, and trying to fix it with a 
better mechanism. I'll let you know when it's ready.



Thanks,
Shiqing

On 2010-4-19 11:00 AM, ben.kupp...@shell.com wrote:


Shiqing,

I am having more import/export issues once I start using the openmpi 
binaries that I built with the Microsoft compiler. I get unresolved 
symbol errors for MPI::Comm::Comm and for MPI::Datatype::Free when I 
link our own program. The C functions MPI_Comm_create and 
MPI_Type_free are exported but the C++ equivalents apparently are not. 
Our source code builds and runs without issues with the Linux version 
of openmpi.


Do you have any suggestions?

-Ben

*From:* Shiqing Fan [mailto:f...@hlrs.de]
*Sent:* Friday, April 16, 2010 10:59 AM
*To:* Open MPI Users
*Cc:* Kuppers, Ben SIEP-PTT/SDRM
*Subject:* Re: [OMPI users] import/export issues on Windows

Hi Ben,



I believe changing OMPI_DECLSPEC to __declspec(dllexport) inside 
functions.h will allow the cxx module to build (and export the 
function) but will break any client using (and thus trying to import) 
it. OMPI_DECLSPEC should only be defined as __declspec(dllexport) 
while compiling the cxx module (say when libmpi_cxx_EXPORTS is defined).


Yes, as long as there are more functions to export, they have to be 
defined in that way. I don't see any option for Intel Compiler to 
manage this automatically.



BTW, I also noticed that the Intel compiler has issues with the 
preprocessor definitions for ompi_info "OMPI_CONFIGURE_DATE=\"03:18 PM 
Wed 04/14/2010 \"" and


"OMPI_BUILD_DATE=\"03:18 PM Wed 04/14/2010 \"". The quotes around the 
definitions throw it off completely. Is that something that CMake does 
or do you instruct CMake to do this? Both the Intel and Microsoft 
compiler work correctly without them.


In which project did you see those preprocessor definitions? Because 
for me, I don't see them. Actually, they are not used as preprocessors 
in the whole solution, but they are only some cached variables in 
CMake. Could you please try to do a clean configuration with CMake, 
and see if they still exists?



Thanks,
Shiqing


Thanks,

Ben


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [OMPI users] import/export issues on Windows

2010-04-21 Thread Ben.Kuppers
Thank you Shiqing.

-Ben

 

From: Shiqing Fan [mailto:f...@hlrs.de] 
Sent: Wednesday, April 21, 2010 11:30 AM
To: Open MPI Users
Cc: Kuppers, Ben SIEP-PTT/SDRM
Subject: Re: [OMPI users] import/export issues on Windows

 


Hi Ben,

Sorry for response late.

The preprocessor problem is solved now, I don't know why Intel compiler
doesn't accept such kind of definitions. But if you use the latest
trunk, it should work.

I'm working on the import/export problem, and trying to fix it with a
better mechanism. I'll let you know when it's ready.


Thanks,
Shiqing

On 2010-4-19 11:00 AM, ben.kupp...@shell.com wrote: 

Shiqing,

 

I am having more import/export issues once I start using the openmpi
binaries that I built with the Microsoft compiler. I get unresolved
symbol errors for MPI::Comm::Comm and for MPI::Datatype::Free when I
link our own program. The C functions MPI_Comm_create and MPI_Type_free
are exported but the C++ equivalents apparently are not. Our source code
builds and runs without issues with the Linux version of openmpi.

 

Do you have any suggestions?

 

-Ben



[OMPI users] Totalview ( tvscript ) & Open MPI problem with memory debugging

2010-04-21 Thread Conboy, James
( For information only )

I reported a bug which stops Totalview saving .mdbg files using tvscript
-

mpirun -V   mpirun (Open MPI) 1.4
totalview -vLinux x86 TotalView 8.8.0-0

This is logged ( by Totalview )as 

CR 12192 - Tvscript fails to save mdbg file when specified with line
number 
action point during OpenMPI debugging

Running this script ( under Sun grid engine )-

 tvscript -verbosity=info -memory_debugging \
   -create_actionpoint "mpiPi.F#92=>save_memory_debugging_file" \
   -stdin ../_pi_ -stdout mpi_pi.log  \
   -mpi "Open MPI" -np $NSLOTS $XE/mpiPi

  - gives an error ( for each process ) 

 Thread 1.1 hit breakpoint 3 at line 92 in "MAIN"

 ERROR:  Failed call to action handler with command 
 'memory_action_save_memory_debugging_file {options {} actionpoint_id 3 
 thread_id 1.1 event actionpoint actionpoint_source_loc_expr 92 
 process_id 1}' (expected integer but got "1190@145.239.47.142")

 ( & no .mdgb files.. )  Non-memory options ( eg traceback ) appear to
work OK


With Totalview 8.6 & OpenMPI 1.3.3 I got

ERROR:  Unknown event  while trying to notify handlers
Killed slave process 4, named "mpiPi"

when trying to launch the processes, so 8.8/1.4 is a definite
improvement..


Jim Conboy  ( Culham Centre for Fusion Energy )



Re: [OMPI users] OS X - Can't find the absoft directory

2010-04-21 Thread Jeff Squyres
On Apr 20, 2010, at 7:03 PM, Paul Cizmas wrote:

> Is it possible to have two openmpi-s on the same computer?  

Yes.  As an OMPI developer, I have dozens of different OMPI installs on my 
cluster (for various stages of development and testing, etc.).  The only real 
issue is to ensure that you set your PATH (and possibly LD_LIBRARY_PATH) 
correctly to point to the one that you want.  If you use the 
--enable-mpirun-prefix-by-default configure option, then you don't need to 
worry about paths on the remote nodes.

> I have 
> openmpi 1.3.2 working fine with gfortran but I cannot build openmpi 
> 1.4.1 with Absoft - I get this message from libtool:
> 
> /bin/sh ../../../libtool   --mode=compile /Applications/Absoft11.0/bin/
> f90 -I../../../ompi/include -I../../../ompi/include -p. -I. -I../../../
> ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90
> libtool: compile:  /Applications/Absoft11.0/bin/f90 -I../../../ompi/
> include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 -
> c mpi.f90  -o .libs/mpi.o
> Can't find the absoft directory.
> Please set the ABSOFT environment variable and try again.
> make[4]: *** [mpi.lo] Error 1
> 
> Note that ABSOFT is properly set as in fact shown above on the first 
> line.  In addition, the absolute address of the f90 (/Applications/
> Absoft11.0/bin/f90) is correct.
> 
> To recreate the problem I went to folder openmpi-1.4.1/ompi/mpi/f90, 
> checked again ABSOFT variable and called libtool.  The result is 
> obviously the same:
> 
> sudo /bin/sh ../../../libtool   --mode=compile /Applications/
> Absoft11.0/bin/f90 -I../../../ompi/include -I../../../ompi/include -p. 
> -I. -I../../../ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90
> Password:

Why are you sudo'ing here?

We just had another user on the list have a problem compiling because they were 
doing "sudo make all" instead of just "make all".  I'm not sure what exactly 
happened, but "sudo make all" apparently had some weird side effects whereas 
"make all" did not.  

FWIW: Most users compile Open MPI as a non-privlidged user and then only use 
sudo to "make install".

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OS X - Can't find the absoft directory

2010-04-21 Thread Paul Cizmas
Jeff:

Thank you!

Paul

On Apr 21, 2010, at 7:41 AM, Jeff Squyres wrote:

> On Apr 20, 2010, at 7:03 PM, Paul Cizmas wrote:
> 
>> Is it possible to have two openmpi-s on the same computer?  
> 
> Yes.  As an OMPI developer, I have dozens of different OMPI installs on my 
> cluster (for various stages of development and testing, etc.).  The only real 
> issue is to ensure that you set your PATH (and possibly LD_LIBRARY_PATH) 
> correctly to point to the one that you want.  If you use the 
> --enable-mpirun-prefix-by-default configure option, then you don't need to 
> worry about paths on the remote nodes.
> 
>> I have 
>> openmpi 1.3.2 working fine with gfortran but I cannot build openmpi 
>> 1.4.1 with Absoft - I get this message from libtool:
>> 
>> /bin/sh ../../../libtool   --mode=compile /Applications/Absoft11.0/bin/
>> f90 -I../../../ompi/include -I../../../ompi/include -p. -I. -I../../../
>> ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90
>> libtool: compile:  /Applications/Absoft11.0/bin/f90 -I../../../ompi/
>> include -I../../../ompi/include -p. -I. -I../../../ompi/mpi/f90 -lU77 -
>> c mpi.f90  -o .libs/mpi.o
>> Can't find the absoft directory.
>> Please set the ABSOFT environment variable and try again.
>> make[4]: *** [mpi.lo] Error 1
>> 
>> Note that ABSOFT is properly set as in fact shown above on the first 
>> line.  In addition, the absolute address of the f90 (/Applications/
>> Absoft11.0/bin/f90) is correct.
>> 
>> To recreate the problem I went to folder openmpi-1.4.1/ompi/mpi/f90, 
>> checked again ABSOFT variable and called libtool.  The result is 
>> obviously the same:
>> 
>> sudo /bin/sh ../../../libtool   --mode=compile /Applications/
>> Absoft11.0/bin/f90 -I../../../ompi/include -I../../../ompi/include -p. 
>> -I. -I../../../ompi/mpi/f90  -lU77 -c -o mpi.lo mpi.f90
>> Password:
> 
> Why are you sudo'ing here?
> 
> We just had another user on the list have a problem compiling because they 
> were doing "sudo make all" instead of just "make all".  I'm not sure what 
> exactly happened, but "sudo make all" apparently had some weird side effects 
> whereas "make all" did not.  
> 
> FWIW: Most users compile Open MPI as a non-privlidged user and then only use 
> sudo to "make install".
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] 'readv failed: Connection timed out' issue

2010-04-21 Thread Jeff Squyres
On Apr 20, 2010, at 8:55 AM, Jonathan Dursi wrote:

> We've got OpenMPI 1.4.1 and Intel MPI running on our 3000 node system.   We 
> like OpenMPI for large jobs, because the startup time is much faster (and 
> startup is more reliable) than the current defaults with IntelMPI; but we're 
> having some pretty serious problems when the jobs are actually running.   
> When running medium- to large- sized jobs (say, anything over 500 cores) over 
> ethernet using OpenMPI, several of our users, using a variety of very 
> different sorts of codes, report errors like this:
> 
> [gpc-f102n010][[30331,1],212][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

That's odd -- this error message indicates that the TCP BTL had previously 
successfully established the connection and was trying to receive an MPI 
message on the socket.  But then reading from the socket timed out.  Hmm.

> which sometimes hang the job, or sometimes kill it outright:
> 
> [gpc-f114n073][[23186,1],109][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
> [gpc-f114n075][[23186,1],125][btl_tcp_frag.c:214:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

This one is a little different -- it indicates that n075 saw its peer hang up 
the socket, which it determined to be a fatal error and therefore aborted.  
(sidenote: I'm not sure why we don't judge a connection timeout to be the same 
fatal error... hmmm...)

It's likely that n073 saw the timeout, closed the socket, and then n075 saw 
that hangup.  It might not happen in all scenarios, because n075 might not see 
the hangup unless it's actively trying to read something from that peer's 
socket.

Regardless, the real question is: why is the socket timing out?

> Unfortunately, this only happens intermittently, and only with large jobs, so 
> it is hard to track down.It seems to happen more reliably with larger 
> numbers of processors, but I don't know if that tells us something real about 
> the issue, or just that larger N -> better statistics. For one users 
> code, it definitely occurs during an MPI_Wait (this particular code has been 
> run on a wide variety of machines with a wide variety of MPIs -- which isn't 
> proof of correctness of course, but everything looks fine), for others it is 
> less clear.  

I think it's reasonable to see this in MPI_Wait -- it means that OMPI was 
notified that there was something to read of a particular socket file 
descriptor and was trying to read it (and then timed out).  I'll bet that the 
others all died in some kind of communication with a specific peer (regardless 
of whether it was in a collective or point-to-point communication call).  

> I don't know if it's an OpenMPI issue, or just represents a network issue 
> which Intel's MPI happens to be more tolerant of with the default set of  
> parameters.   It's also unclear whether or not this issue occurred with 
> earlier OpenMPI versions.
> 
> Where should I start looking to find out what is going on?   Are there 
> parameters that can be adjusted to play with timeouts to see if the issue can 
> be localized, or worked around?

Can you see if there's any kernel parameters to adjust how many fd's you can 
have open simultaneously, and the length of TCP socket timeouts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] unresolved symbol mca_base_param_reg_int

2010-04-21 Thread Nev
O
n Tue, 2010-04-20 at 20:22 -0400, Jeff Squyres wrote:
> On Apr 20, 2010, at 6:16 PM, Nev wrote:
> 
> > Hi Jeff,
> > I did the install to the "same place". I always use /opt/openmpi, the
> > procedure I use when building is
> > configure --prefix=/opt/openmpi ...
> > rm -r /opt/openmpi/*
> > make clean
> > make all
> > make install
> > is this sufficient to un-install previous version, or is more required.
> 
> Yes, that should be sufficient.  Is that what you did this time?  
> 
> If so, is there any way you can provide a small code example of the problem 
> you're seeing?
> 
OK, I will attempt to reduce to minimal code set, but will not be able
to do so until the week end.