It is always a good idea to have your application's sizeof(INTEGER) match the
MPI's sizeof(INTEGER). Having them mismatch is a recipe for trouble.
Meaning: if you're compiling your app with -make-integer-be-8-bytes, then you
should configure/build Open MPI with that same flag.
I'm thinking that this should *only* affect the back-end behavior of
MPI_INTEGER; the size of address pointers and whatnot should not be affected
(unless -make-integer-be-8-bytes also changes the sizes of some other types).
On Dec 5, 2010, at 9:01 PM, Gustavo Correa wrote:
> Hi Benjamin
>
> I guess you could compile OpenMPI with standard integer and real sizes.
> Then compile your application (DRAGON) with the flags to change to 8-byte
> integers and 8-byte reals.
> We have some programs here that use real8 and are compiled this way,
> and run without a problem.
> I guess this is what Tim Prince was also telling you in his comments.
>
> You can pass those flags to the MPI compiler wrappers (mpif77 etc),
> which will relay them to gfortran when you compile DRAGON.
>
> I am not even sure if those flags would be accepted or ignored by OpenMPI
> when you build it.
> I guess they will be ignored.
> You could check this out by looking at the MPI type sizes in your header
> files in the include directory and subdirectories.
>
> Maybe an OpenMPI developer could shed some light here.
>
> Moreover, if I remember right,
> the MPI address type complies with the machine architecture,
> i.e., 32 bits if your machine is 32-bit, 64-bits if the machine is 64-bit,
> and you don't need to force it to be 8-bytes with compilation flags.
>
> Unfortunately mixing pointers ("Cray pointers", I suppose)
> with integers is a common source of headaches, if DRAGON does this.
> It is yet another possible situation where negative integers could crop in
> and lead to segmentation fault.
> At least one ocean circulation model we run here had
> many problems because of this mix of integers and (Cray) pointers
> spread all across the code.
>
> Gus Correa
>
> On Dec 5, 2010, at 7:17 PM, Benjamin Toueg wrote:
>
>> Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of
>> pointers. If I compile it in 64bits without -f-default-integer-8, the
>> so-called pointers will remain in 32bits. Problems could also arise from its
>> data structure handlers.
>>
>> Therefore -f-default-integer-8 is absolutely necessary.
>>
>> Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one
>> source file (used for passing a data structure from one node to another) and
>> it has proved to be working in every situtation.
>>
>> Not knowing which line is causing my segfault is annoying. <323.gif>
>>
>> Regards,
>> Benjamin
>>
>> 2010/12/6 Gustavo Correa <[email protected]>
>> Hi Benjamin
>>
>> I would just rebuild OpenMPI withOUT the compiler flags that change the
>> standard
>> sizes of "int" and "float" (do a "make cleandist" first!), then recompile
>> your program,
>> and see how it goes.
>> I don't think you are gaining anything by trying to change the standard
>> "int/integer" and
>> "real/float" sizdes, and most likely they are inviting trouble, making
>> things more confusing.
>> Worst scenario, you will at least be sure that the bug is somewhere else,
>> not on the mismatch
>> of basic type sizes.
>>
>> If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or
>> MPI_REAL8
>> in your (Fortran) MPI calls, and declare them in the Fortran code accordingly
>> (double precision or real(kind=8)).
>>
>> If I remember right, there is no 8-byte integer support in the Fortran MPI
>> bindings,
>> only in the C bindings, but some OpenMPI expert could clarify this.
>> Hence, if you are passing 8-byte integers in your MPI calls this may be also
>> problematic.
>>
>> My two cents,
>> Gus Correa
>>
>> On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote:
>>
>>> Hi,
>>>
>>> First of all thanks for your insight !
>>>
>>> Do you get a corefile?
>>> I don't get a core file, but I get a file called _FIL001. It doesn't
>>> contain any debugging symbols. It's most likely a digested version of the
>>> input file given to the executable : ./myexec < inputfile.
>>>
>>> there's no line numbers printed in the stack trace
>>> I would love to see those, but even if I compile openmpi with -debug
>>> -mem-debug -mem-profile, they don't show up. I recompiled my sources to be
>>> sure to properly link them to the newly debugged version of openmpi. I
>>> assumed I didn't need to compile my own sources with -g option since it
>>> crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I
>>> guess it wont help since I already get the trace.
>>>
>>> the -fdefault-integer-8 options ought to be highly dangerous
>>> Thanks for noting. Indeed I had some issues with this option. For instance
>>> I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in :
>>> CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
>>> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
>>> In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag,
>>> MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as
>>> you mentioned it could flip to the negative). However I haven't declared it
>>> as INTEGER*4 and I think I should.
>>> When I said "I had to raise the number of data strucutures to be sent", I
>>> meant that I had to call MPI_SEND many more times, not that buffers were
>>> bigger than before.
>>>
>>> I'll get back to you with more info when I'll be able to fix my connexion
>>> problem to the cluster...
>>>
>>> Thanks,
>>> Benjamin
>>>
>>> 2010/12/3 Martin Siegert <[email protected]>
>>> Hi All,
>>>
>>> just to expand on this guess ...
>>>
>>> On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
>>>> Hi All
>>>>
>>>> I wonder if configuring OpenMPI while
>>>> forcing the default types to non-default values
>>>> (-fdefault-integer-8 -fdefault-real-8) might have
>>>> something to do with the segmentation fault.
>>>> Would this be effective, i.e., actually make the
>>>> the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
>>>> or just elusive?
>>>
>>> I believe what happens is that this mostly affects the fortran
>>> wrapper routines and the way Fortran variables are mapped to C:
>>>
>>> MPI_INTEGER -> MPI_LONG
>>> MPI_FLOAT -> MPI_DOUBLE
>>> MPI_DOUBLE_PRECISION -> MPI_DOUBLE
>>>
>>> In that respect I believe that the -fdefault-real-8 option is harmless,
>>> i.e., it does the expected thing.
>>> But the -fdefault-integer-8 options ought to be highly dangerous:
>>> It works for integer variables that are used as "buffer" arguments
>>> in MPI statements, but I would assume that this does not work for
>>> "count" and similar arguments.
>>> Example:
>>>
>>> integer, allocatable :: buf(*,*)
>>> integer i, count, dest, tag, mpierr
>>>
>>> i = 32768
>>> i2 = 2*i
>>> allocate(buf(i,i2))
>>> count = i*i2
>>> buf = 1
>>> call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
>>>
>>> Now count is 2^31 which overflows a 32bit integer.
>>> The MPI standard requires that count is a 32bit integer, correct?
>>> Thus while buf gets the type MPI_LONG, count remains an int.
>>> Is this interpretation correct? If it is, then you are calling
>>> MPI_Send with a count argument of -2147483648.
>>> Which could result in a segmentation fault.
>>>
>>> Cheers,
>>> Martin
>>>
>>> --
>>> Martin Siegert
>>> Head, Research Computing
>>> WestGrid/ComputeCanada Site Lead
>>> IT Services phone: 778 782-4691
>>> Simon Fraser University fax: 778 782-4242
>>> Burnaby, British Columbia email: [email protected]
>>> Canada V5A 1S6
>>>
>>>> There were some recent discussions here about MPI
>>>> limiting counts to MPI_INTEGER.
>>>> Since Benjamin said he "had to raise the number of data structures",
>>>> which eventually led to the the error,
>>>> I wonder if he is inadvertently flipping to negative integer
>>>> side of the 32-bit universe (i.e. >= 2**31), as was reported here by
>>>> other list subscribers a few times.
>>>>
>>>> Anyway, segmentation fault can come from many different places,
>>>> this is just a guess.
>>>>
>>>> Gus Correa
>>>>
>>>> Jeff Squyres wrote:
>>>>> Do you get a corefile?
>>>>>
>>>>> It looks like you're calling MPI_RECV in Fortran and then it segv's.
>>>>> This is *likely* because you're either passing a bad parameter or your
>>>>> buffer isn't big enough. Can you double check all your parameters?
>>>>>
>>>>> Unfortunately, there's no line numbers printed in the stack trace, so
>>>>> it's not possible to tell exactly where in the ob1 PML it's dying (i.e.,
>>>>> so we can't see exactly what it's doing to cause the segv).
>>>>>
>>>>>
>>>>>
>>>>> On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am using DRAGON, a neutronic simulation code in FORTRAN77 that has its
>>>>>> own datastructures. I added a module to send these data structures
>>>>>> thanks to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a
>>>>>> while.
>>>>>>
>>>>>> Then I had to raise the number of data structures to be sent up to a
>>>>>> point where my cluster has this bug :
>>>>>> *** Process received signal ***
>>>>>> Signal: Segmentation fault (11)
>>>>>> Signal code: Address not mapped (1)
>>>>>> Failing at address: 0x2c2579fc0
>>>>>> [ 0] /lib/libpthread.so.0 [0x7f52d2930410]
>>>>>> [ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03]
>>>>>> [ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2)
>>>>>> [0x7f52d3504a1e]
>>>>>> [ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e)
>>>>>> [0x7f52d36cf9c6]
>>>>>>
>>>>>> How can I make this error more explicit ?
>>>>>>
>>>>>> I use the following configuration of openmpi-1.4.3 :
>>>>>> ./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++ CC=gcc
>>>>>> F77=gfortran FC=gfortran FLAGS="-m64 -fdefault-integer-8
>>>>>> -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8
>>>>>> -fdefault-real-8 -fdefault-double-8" --disable-mpi-f90
>>>>>>
>>>>>> Here is the output of mpif77 -v :
>>>>>> mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51
>>>>>> Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt
>>>>>> -lfrtbegin -lg2c -lm -shared-libgcc
>>>>>> Lecture des spécification à partir de
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs
>>>>>> Configuré avec: ../src/configure -v --enable-languages=c,c++,f77,pascal
>>>>>> --prefix=/usr --libexecdir=/usr/lib
>>>>>> --with-gxx-include-dir=/usr/include/c++/3.4 --enable-shared
>>>>>> --with-system-zlib --enable-nls --without-included-gettext
>>>>>> --program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu
>>>>>> --enable-libstdcxx-debug x86_64-linux-gnu
>>>>>> Modèle de thread: posix
>>>>>> version gcc 3.4.6 (Debian 3.4.6-5)
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m
>>>>>> elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crti.o
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtbegin.o -L/usr/lib/mpich-mpd/lib
>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6
>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6
>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib
>>>>>> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../.. -L/lib/../lib
>>>>>> -L/usr/lib/../lib -lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm
>>>>>> -lgcc_s -lgcc -lc -lgcc_s -lgcc
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtend.o
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crtn.o
>>>>>> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/libfrtbegin.a(frtbegin.o):
>>>>>> dans la fonction ▒ main ▒:
>>>>>> (.text+0x1e): référence indéfinie vers ▒ MAIN__ ▒
>>>>>> collect2: ld a retourné 1 code d'état d'exécution
>>>>>>
>>>>>> Thanks,
>>>>>> Benjamin
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/