Re: [OMPI users] MPI_Send doesn't work if the data >= 2GB

2010-12-05 Thread 孟宪军
hi,

In my computers(X86-64), the sizeof(int)=4, but the
sizeof(long)=sizeof(double)=sizeof(size_t)=8. when I checked my mpi.h file,
I found that the definition about the sizeof(int) is correct. meanwhile, I
think the mpi.h file was generated according to my compute environment when
I compiled the Openmpi. So, my codes still don't work. :(

Further, I found when I called the collective routines(such as,
MPI_Allgatherv(...)) which are implemented by the Point 2 Point don't work
either when the data > 2GB.

Thanks
Xianjun

2010/12/6 Tim Prince 

> On 12/5/2010 7:13 PM, 孟宪军 wrote:
>
>> hi,
>>
>> I met a question recently when I tested the MPI_send and MPI_Recv
>> functions. When I run the following codes, the processes hanged and I
>> found there was not data transmission in my network at all.
>>
>> BTW: I finished this test on two X86-64 computers with 16GB memory and
>> installed Linux.
>>
>> 1 #include 
>> 2 #include 
>> 3 #include 
>> 4 #include 
>> 5
>> 6
>> 7 int main(int argc, char** argv)
>> 8 {
>> 9 int localID;
>> 10 int numOfPros;
>> 11 size_t Gsize = (size_t)2 * 1024 * 1024 * 1024;
>> 12
>> 13 char* g = (char*)malloc(Gsize);
>> 14
>> 15 MPI_Init(&argc, &argv);
>> 16 MPI_Comm_size(MPI_COMM_WORLD, &numOfPros);
>> 17 MPI_Comm_rank(MPI_COMM_WORLD, &localID);
>> 18
>> 19 MPI_Datatype MPI_Type_lkchar;
>> 20 MPI_Type_contiguous(2048, MPI_BYTE, &MPI_Type_lkchar);
>> 21 MPI_Type_commit(&MPI_Type_lkchar);
>> 22
>> 23 if (localID == 0)
>> 24 {
>> 25 MPI_Send(g, 1024*1024, MPI_Type_lkchar, 1, 1, MPI_COMM_WORLD);
>> 26 }
>> 27
>> 28 if (localID != 0)
>> 29 {
>> 30 MPI_Status status;
>> 31 MPI_Recv(g, 1024*1024, MPI_Type_lkchar, 0, 1, \
>> 32 MPI_COMM_WORLD, &status);
>> 33 }
>> 34
>> 35 MPI_Finalize();
>> 36
>> 37 return 0;
>> 38 }
>>
>>  You supplied all your constants as 32-bit signed data, so, even if the
> count for MPI_Send() and MPI_Recv() were a larger data type, you would see
> this limit. Did you look at your  ?
>
> --
> Tim Prince
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] MPI_Send doesn't work if the data >= 2GB

2010-12-05 Thread Tim Prince

On 12/5/2010 7:13 PM, 孟宪军 wrote:

hi,

I met a question recently when I tested the MPI_send and MPI_Recv
functions. When I run the following codes, the processes hanged and I
found there was not data transmission in my network at all.

BTW: I finished this test on two X86-64 computers with 16GB memory and
installed Linux.

1 #include 
2 #include 
3 #include 
4 #include 
5
6
7 int main(int argc, char** argv)
8 {
9 int localID;
10 int numOfPros;
11 size_t Gsize = (size_t)2 * 1024 * 1024 * 1024;
12
13 char* g = (char*)malloc(Gsize);
14
15 MPI_Init(&argc, &argv);
16 MPI_Comm_size(MPI_COMM_WORLD, &numOfPros);
17 MPI_Comm_rank(MPI_COMM_WORLD, &localID);
18
19 MPI_Datatype MPI_Type_lkchar;
20 MPI_Type_contiguous(2048, MPI_BYTE, &MPI_Type_lkchar);
21 MPI_Type_commit(&MPI_Type_lkchar);
22
23 if (localID == 0)
24 {
25 MPI_Send(g, 1024*1024, MPI_Type_lkchar, 1, 1, MPI_COMM_WORLD);
26 }
27
28 if (localID != 0)
29 {
30 MPI_Status status;
31 MPI_Recv(g, 1024*1024, MPI_Type_lkchar, 0, 1, \
32 MPI_COMM_WORLD, &status);
33 }
34
35 MPI_Finalize();
36
37 return 0;
38 }

You supplied all your constants as 32-bit signed data, so, even if the 
count for MPI_Send() and MPI_Recv() were a larger data type, you would 
see this limit. Did you look at your  ?


--
Tim Prince



[OMPI users] MPI_Send doesn't work if the data >= 2GB

2010-12-05 Thread 孟宪军
hi,

I met a question recently when I tested the MPI_send and MPI_Recv
functions.  When I run the following codes, the  processes hanged and I
found there was not data transmission in my network at all.

BTW: I finished this test on two X86-64 computers with 16GB memory and
installed Linux.

 1 #include 
 2 #include 
 3 #include 
 4 #include 
 5
 6
 7 int main(int argc, char** argv)
 8 {
 9   int localID;
 10 int numOfPros;
 11 size_t Gsize = (size_t)2 * 1024 * 1024 * 1024;
 12
 13 char* g = (char*)malloc(Gsize);
 14
 15 MPI_Init(&argc, &argv);
 16 MPI_Comm_size(MPI_COMM_WORLD, &numOfPros);
 17 MPI_Comm_rank(MPI_COMM_WORLD, &localID);
 18
 19 MPI_Datatype MPI_Type_lkchar;
 20 MPI_Type_contiguous(2048, MPI_BYTE, &MPI_Type_lkchar);
 21 MPI_Type_commit(&MPI_Type_lkchar);
 22
 23 if (localID == 0)
 24 {
 25 MPI_Send(g, 1024*1024, MPI_Type_lkchar, 1, 1, MPI_COMM_WORLD);
 26 }
 27
 28 if (localID != 0)
 29 {
 30 MPI_Status status;
 31 MPI_Recv(g, 1024*1024, MPI_Type_lkchar, 0, 1, \
 32 MPI_COMM_WORLD, &status);
 33 }
 34
 35 MPI_Finalize();
 36
 37 return 0;
 38 }

Thanks
Xianjun


Re: [OMPI users] Scalability issue

2010-12-05 Thread Benjamin Toueg
Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of
pointers. If I compile it in 64bits without -f-default-integer-8, the
so-called pointers will remain in 32bits. Problems could also arise from its
data structure handlers.

Therefore -f-default-integer-8 is absolutely necessary.

Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one
source file (used for passing a data structure from one node to another) and
it has proved to be working in every situtation.

Not knowing which line is causing my segfault is annoying. [?]

Regards,
Benjamin

2010/12/6 Gustavo Correa 

> Hi Benjamin
>
> I would just rebuild OpenMPI withOUT the compiler flags that change the
> standard
> sizes of  "int" and "float" (do a "make cleandist" first!), then recompile
> your program,
> and see how it goes.
> I don't think you are gaining anything by trying to change the standard
> "int/integer" and
> "real/float" sizdes, and most likely they are inviting trouble, making
> things more confusing.
> Worst scenario, you will at least be sure that the bug is somewhere else,
> not on the mismatch
> of basic type sizes.
>
> If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or
> MPI_REAL8
> in your (Fortran) MPI calls, and declare them in the Fortran code
> accordingly
> (double precision or real(kind=8)).
>
> If I remember right, there is no  8-byte integer support in the Fortran MPI
> bindings,
> only in the C bindings, but some OpenMPI expert could clarify this.
> Hence, if you are passing 8-byte integers in your MPI calls this may be
> also problematic.
>
> My two cents,
> Gus Correa
>
> On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote:
>
> > Hi,
> >
> > First of all thanks for your insight !
> >
> > Do you get a corefile?
> > I don't get a core file, but I get a file called _FIL001. It doesn't
> contain any debugging symbols. It's most likely a digested version of the
> input file given to the executable : ./myexec < inputfile.
> >
> > there's no line numbers printed in the stack trace
> > I would love to see those, but even if I compile openmpi with -debug
> -mem-debug -mem-profile, they don't show up. I recompiled my sources to be
> sure to properly link them to the newly debugged version of openmpi. I
> assumed I didn't need to compile my own sources with -g option since it
> crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I
> guess it wont help since I already get the trace.
> >
> > the -fdefault-integer-8 options ought to be highly dangerous
> > Thanks for noting. Indeed I had some issues with this option. For
> instance I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR
> in :
> > CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
> > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
> > In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag,
> MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as
> you mentioned it could flip to the negative). However I haven't declared it
> as INTEGER*4 and I think I should.
> > When I said "I had to raise the number of data strucutures to be sent", I
> meant that I had to call MPI_SEND many more times, not that buffers were
> bigger than before.
> >
> > I'll get back to you with more info when I'll be able to fix my connexion
> problem to the cluster...
> >
> > Thanks,
> > Benjamin
> >
> > 2010/12/3 Martin Siegert 
> > Hi All,
> >
> > just to expand on this guess ...
> >
> > On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
> > > Hi All
> > >
> > > I wonder if configuring OpenMPI while
> > > forcing the default types to non-default values
> > > (-fdefault-integer-8 -fdefault-real-8) might have
> > > something to do with the segmentation fault.
> > > Would this be effective, i.e., actually make the
> > > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
> > > or just elusive?
> >
> > I believe what happens is that this mostly affects the fortran
> > wrapper routines and the way Fortran variables are mapped to C:
> >
> > MPI_INTEGER -> MPI_LONG
> > MPI_FLOAT   -> MPI_DOUBLE
> > MPI_DOUBLE_PRECISION -> MPI_DOUBLE
> >
> > In that respect I believe that the -fdefault-real-8 option is harmless,
> > i.e., it does the expected thing.
> > But the -fdefault-integer-8 options ought to be highly dangerous:
> > It works for integer variables that are used as "buffer" arguments
> > in MPI statements, but I would assume that this does not work for
> > "count" and similar arguments.
> > Example:
> >
> > integer, allocatable :: buf(*,*)
> > integer i, count, dest, tag, mpierr
> >
> > i = 32768
> > i2 = 2*i
> > allocate(buf(i,i2))
> > count = i*i2
> > buf = 1
> > call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
> >
> > Now count is 2^31 which overflows a 32bit integer.
> > The MPI standard requires that count is a 32bit integer, correct?
> > Thus while buf gets the type MPI_LONG, count remains an int.
> > Is this interpretation correct? If it is, 

Re: [OMPI users] Scalability issue

2010-12-05 Thread Tim Prince

On 12/5/2010 3:22 PM, Gustavo Correa wrote:

I would just rebuild OpenMPI withOUT the compiler flags that change the standard
sizes of  "int" and "float" (do a "make cleandist" first!), then recompile your 
program,
and see how it goes.
I don't think you are gaining anything by trying to change the standard 
"int/integer" and
"real/float" sizdes, and most likely they are inviting trouble, making things 
more confusing.
Worst scenario, you will at least be sure that the bug is somewhere else, not 
on the mismatch
of basic type sizes.

If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or MPI_REAL8
in your (Fortran) MPI calls, and declare them in the Fortran code accordingly
(double precision or real(kind=8)).

If I remember right, there is no  8-byte integer support in the Fortran MPI 
bindings,
only in the C bindings, but some OpenMPI expert could clarify this.
Hence, if you are passing 8-byte integers in your MPI calls this may be also 
problematic.
My colleagues routinely use 8-byte integers with Fortran, but I agree 
it's not done by changing openmpi build parameters.   They do use 
Fortran compile line options for the application to change the default 
integer and real to 64-bit.  I wasn't aware of any reluctance to use 
MPI_INTEGER8.


--
Tim Prince



Re: [OMPI users] error mesages appeared but program runs successfully?

2010-12-05 Thread Gustavo Correa
Hi Daofeng

It is hard to tell what is happening in the Infiniband side of the problem.
Did somebody perhaps remove the Infiniband card from this machine?
Was it ever there?
Did somebody perhaps changed the Linux kernel modules that are loaded
(perhaps changing /etc/module.config or similar)?
Maybe other people in your organization know.

If this is a single computer, not a cluster, you don't loose anything by not
having Infinband.
In this case, you can reinstall OpenMPI without Infiniband support, by just
doing "make distclean" in the OpenMPI build directory (to cleanup what is 
there),
then "./configure --prefix=/wherever/you/want/to/install --without-openib",
then "make", and "make install".

Alternatively, you can continue to use what you already have with the "-mca btl 
^openib" flag.

If this is a cluster, of course you would benefit from Infiniband, which is a 
faster
network than Ethernet or Gigabit Ethernet.
In this case you need to ask for help of somebody that knows more about your 
cluster
hardware, to restore the Infiniband to a sane and healthy state.
Or, if there is no Infinband hardware, or if it is broken, just reinstall 
OpenMPi following
the little recipe above.  You will be able to run your programs using Ethernet 
(I assume  
the cluster would have Ethernet).  Not very fast, but will work.

My two cents,
Gus Correa


On Dec 4, 2010, at 4:47 AM, Daofeng Li wrote:

> Hi Gus,
>  
> thank you for your response.
> i think this is much about hardware which i know little about them:)
> might be the machine i used dont have the card you mentioned as i run:
>  /usr/sbin/ibstat
> ibwarn: [4260] umad_init: can't read ABI version from 
> /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad 
> module loaded?
> ibpanic: [4260] main: can't init UMAD library: (No such file or directory)
> 
> but you really helped me as:
>  
> $ mpirun -mca btl ^openib -n 8 hello_cxx
> Hello, world!  I am 6 of 8
> Hello, world!  I am 0 of 8
> Hello, world!  I am 4 of 8
> Hello, world!  I am 7 of 8
> Hello, world!  I am 5 of 8
> Hello, world!  I am 2 of 8
> Hello, world!  I am 1 of 8
> Hello, world!  I am 3 of 8
>  
> that's really cool~
>  
> thank you all:)
>  
> Best Wishes.
> On Sat, Dec 4, 2010 at 11:12 AM, Gus Correa  wrote:
> Hi Daofeng
> 
> Do you have an Infiniband card in the machine where you are
> running the program?
> (Open Fabrics / OFED is the software support for Infiniband.
> I guess you need the same version installed in all machines.)
> 
> Does the directory referred in the error message actually
> exist in your machine (i.e,  /dev/infiniband) ?
> 
> Are you running it in the same machine where you installed OpenMPI?
> 
> What output do you get from:
> /usr/sbin/ibstat
> ?
> 
> Did you compile the programs with the mpicc,mpiCC, mpif77
> from the same OpenMPI that you built?
> (Some Linux distributions and compilers come with
> their own flavors of MPI, or you may also
> have installed MPICH or MVAPICH, so it is not uncommon to mix up.)
> 
> Have you tried to suppress the use of Infinband, i.e.:
> 
> mpirun -mca btl ^openib -n 8 hello_cxx
> 
> (Well, "openib" is the OpenMPI support for Infiniband.
> The "^" means "don't use it")
> 
> I hope this helps,
> Gus Correa
> 
> Daofeng Li wrote:
> Dear Jeff,
>  actually i didnot understand thiscan you or anyone tell me what to do?
>  Thx.
>  Best.
> 
> On Fri, Dec 3, 2010 at 9:41 PM, Jeff Squyres (jsquyres)  > wrote:
> 
>It means that you probably have a version mismatch with your
>OpenFabrics drivers and or you have no OpenFabrics hardware and you
>should probably disable those drivers.  
>Sent from my PDA. No type good. 
>On Dec 3, 2010, at 4:56 AM, "Daofeng Li" > wrote:
> 
>Dear list,
> i am currently try to use the OpenMPI package
>i install it at my home directory
>./configure --prefix=$HOME --enable-mpi-threads
>make
>make install
> and the i add the ~/bin to the path and ~/lib to the
>ld_library_path to my .bashrc file
> everything seems normal as i can run the example programs:
>mpirun -n 8 hello_cxx
>mpirun -n 8 hello_f77
>mpirun -n 8 hello_c
>etc...
> but error messages appeas:
> $ mpirun -n 8 hello_cxx
>librdmacm: couldn't read ABI version.
>librdmacm: assuming: 4
>libibverbs: Fatal: couldn't read uverbs ABI version.
>CMA: unable to open /dev/infiniband/rdma_cm
>libibverbs: Fatal: couldn't read uverbs ABI version.
>--
>[[32727,1],1]: A high-performance Open MPI point-to-point
>messaging module
>was unable to find any relevant network interfaces:
>Module: OpenFabrics (openib)
>  Host: localhost.localdomain
>Another transport will be used instead, although this may result in
>lower performance.
>

Re: [OMPI users] Scalability issue

2010-12-05 Thread Gustavo Correa
Hi Benjamin

I would just rebuild OpenMPI withOUT the compiler flags that change the standard
sizes of  "int" and "float" (do a "make cleandist" first!), then recompile your 
program,
and see how it goes.
I don't think you are gaining anything by trying to change the standard 
"int/integer" and
"real/float" sizdes, and most likely they are inviting trouble, making things 
more confusing.
Worst scenario, you will at least be sure that the bug is somewhere else, not 
on the mismatch
of basic type sizes.

If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or MPI_REAL8 
in your (Fortran) MPI calls, and declare them in the Fortran code accordingly 
(double precision or real(kind=8)).

If I remember right, there is no  8-byte integer support in the Fortran MPI 
bindings,
only in the C bindings, but some OpenMPI expert could clarify this. 
Hence, if you are passing 8-byte integers in your MPI calls this may be also 
problematic.

My two cents,
Gus Correa

On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote:

> Hi,
> 
> First of all thanks for your insight !
> 
> Do you get a corefile?
> I don't get a core file, but I get a file called _FIL001. It doesn't contain 
> any debugging symbols. It's most likely a digested version of the input file 
> given to the executable : ./myexec < inputfile.
> 
> there's no line numbers printed in the stack trace
> I would love to see those, but even if I compile openmpi with -debug 
> -mem-debug -mem-profile, they don't show up. I recompiled my sources to be 
> sure to properly link them to the newly debugged version of openmpi. I 
> assumed I didn't need to compile my own sources with -g option since it 
> crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I 
> guess it wont help since I already get the trace.
> 
> the -fdefault-integer-8 options ought to be highly dangerous
> Thanks for noting. Indeed I had some issues with this option. For instance I 
> have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in :
> CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
> CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
> In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, 
> MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as 
> you mentioned it could flip to the negative). However I haven't declared it 
> as INTEGER*4 and I think I should.
> When I said "I had to raise the number of data strucutures to be sent", I 
> meant that I had to call MPI_SEND many more times, not that buffers were 
> bigger than before.
> 
> I'll get back to you with more info when I'll be able to fix my connexion 
> problem to the cluster...
> 
> Thanks,
> Benjamin
> 
> 2010/12/3 Martin Siegert 
> Hi All,
> 
> just to expand on this guess ...
> 
> On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
> > Hi All
> >
> > I wonder if configuring OpenMPI while
> > forcing the default types to non-default values
> > (-fdefault-integer-8 -fdefault-real-8) might have
> > something to do with the segmentation fault.
> > Would this be effective, i.e., actually make the
> > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
> > or just elusive?
> 
> I believe what happens is that this mostly affects the fortran
> wrapper routines and the way Fortran variables are mapped to C:
> 
> MPI_INTEGER -> MPI_LONG
> MPI_FLOAT   -> MPI_DOUBLE
> MPI_DOUBLE_PRECISION -> MPI_DOUBLE
> 
> In that respect I believe that the -fdefault-real-8 option is harmless,
> i.e., it does the expected thing.
> But the -fdefault-integer-8 options ought to be highly dangerous:
> It works for integer variables that are used as "buffer" arguments
> in MPI statements, but I would assume that this does not work for
> "count" and similar arguments.
> Example:
> 
> integer, allocatable :: buf(*,*)
> integer i, count, dest, tag, mpierr
> 
> i = 32768
> i2 = 2*i
> allocate(buf(i,i2))
> count = i*i2
> buf = 1
> call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
> 
> Now count is 2^31 which overflows a 32bit integer.
> The MPI standard requires that count is a 32bit integer, correct?
> Thus while buf gets the type MPI_LONG, count remains an int.
> Is this interpretation correct? If it is, then you are calling
> MPI_Send with a count argument of -2147483648.
> Which could result in a segmentation fault.
> 
> Cheers,
> Martin
> 
> --
> Martin Siegert
> Head, Research Computing
> WestGrid/ComputeCanada Site Lead
> IT Servicesphone: 778 782-4691
> Simon Fraser Universityfax:   778 782-4242
> Burnaby, British Columbia  email: sieg...@sfu.ca
> Canada  V5A 1S6
> 
> > There were some recent discussions here about MPI
> > limiting counts to MPI_INTEGER.
> > Since Benjamin said he "had to raise the number of data structures",
> > which eventually led to the the error,
> > I wonder if he is inadvertently flipping to negative integer
> > side of the 32-bit universe (i.e. >= 2**31), as was 

Re: [OMPI users] Scalability issue

2010-12-05 Thread Benjamin Toueg
Hi,

First of all thanks for your insight !

*Do you get a corefile?*
I don't get a core file, but I get a file called _FIL001. It doesn't contain
any debugging symbols. It's most likely a digested version of the input file
given to the executable : ./myexec < inputfile.

*there's no line numbers printed in the stack trace*
I would love to see those, but even if I compile openmpi with -debug
-mem-debug -mem-profile, they don't show up. I recompiled my sources to be
sure to properly link them to the newly debugged version of openmpi. I
assumed I didn't need to compile my own sources with -g option since it
crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I
guess it wont help since I already get the trace.

*the -fdefault-integer-8 options ought to be highly dangerous*
Thanks for noting. Indeed I had some issues with this option. For instance I
have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in :
CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag,
MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as
you mentioned it could flip to the negative). However I haven't declared it
as INTEGER*4 and I think I should.
When I said "I had to raise the number of data strucutures to be sent", I
meant that I had to call MPI_SEND many more times, not that buffers were
bigger than before.

I'll get back to you with more info when I'll be able to fix my connexion
problem to the cluster...

Thanks,
Benjamin

2010/12/3 Martin Siegert 

> Hi All,
>
> just to expand on this guess ...
>
> On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
> > Hi All
> >
> > I wonder if configuring OpenMPI while
> > forcing the default types to non-default values
> > (-fdefault-integer-8 -fdefault-real-8) might have
> > something to do with the segmentation fault.
> > Would this be effective, i.e., actually make the
> > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
> > or just elusive?
>
> I believe what happens is that this mostly affects the fortran
> wrapper routines and the way Fortran variables are mapped to C:
>
> MPI_INTEGER -> MPI_LONG
> MPI_FLOAT   -> MPI_DOUBLE
> MPI_DOUBLE_PRECISION -> MPI_DOUBLE
>
> In that respect I believe that the -fdefault-real-8 option is harmless,
> i.e., it does the expected thing.
> But the -fdefault-integer-8 options ought to be highly dangerous:
> It works for integer variables that are used as "buffer" arguments
> in MPI statements, but I would assume that this does not work for
> "count" and similar arguments.
> Example:
>
> integer, allocatable :: buf(*,*)
> integer i, count, dest, tag, mpierr
>
> i = 32768
> i2 = 2*i
> allocate(buf(i,i2))
> count = i*i2
> buf = 1
> call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
>
> Now count is 2^31 which overflows a 32bit integer.
> The MPI standard requires that count is a 32bit integer, correct?
> Thus while buf gets the type MPI_LONG, count remains an int.
> Is this interpretation correct? If it is, then you are calling
> MPI_Send with a count argument of -2147483648.
> Which could result in a segmentation fault.
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid/ComputeCanada Site Lead
> IT Servicesphone: 778 782-4691
> Simon Fraser Universityfax:   778 782-4242
> Burnaby, British Columbia  email: sieg...@sfu.ca
> Canada  V5A 1S6
>
> > There were some recent discussions here about MPI
> > limiting counts to MPI_INTEGER.
> > Since Benjamin said he "had to raise the number of data structures",
> > which eventually led to the the error,
> > I wonder if he is inadvertently flipping to negative integer
> > side of the 32-bit universe (i.e. >= 2**31), as was reported here by
> > other list subscribers a few times.
> >
> > Anyway, segmentation fault can come from many different places,
> > this is just a guess.
> >
> > Gus Correa
> >
> > Jeff Squyres wrote:
> > >Do you get a corefile?
> > >
> > >It looks like you're calling MPI_RECV in Fortran and then it segv's.
>  This is *likely* because you're either passing a bad parameter or your
> buffer isn't big enough.  Can you double check all your parameters?
> > >
> > >Unfortunately, there's no line numbers printed in the stack trace, so
> it's not possible to tell exactly where in the ob1 PML it's dying (i.e., so
> we can't see exactly what it's doing to cause the segv).
> > >
> > >
> > >
> > >On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote:
> > >
> > >>Hi,
> > >>
> > >>I am using DRAGON, a neutronic simulation code in FORTRAN77 that has
> its own datastructures. I added a module to send these data structures
> thanks to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a
> while.
> > >>
> > >>Then I had to raise the number of data structures to be sent up to a
> point where my cluster has this bug :
>

Re: [OMPI users] difference between single and double precision

2010-12-05 Thread Eugene Loh

Mathieu Gontier wrote:


  Dear OpenMPI users

I am dealing with an arithmetic problem. In fact, I have two variants 
of my code: one in single precision, one in double precision. When I 
compare the two executable built with MPICH, one can observed an 
expected difference of performance: 115.7-sec in single precision 
against 178.68-sec in double precision (+54%).


The thing is, when I use OpenMPI, the difference is really bigger: 
238.5-sec in single precision against 403.19-sec double precision (+69%).


Our experiences have already shown OpenMPI is less efficient than 
MPICH on Ethernet with a small number of processes. This explain the 
differences between the first set of results with MPICH and the second 
set with OpenMPI. (But if someone have more information about that or 
even a solution, I am of course interested.)
But, using OpenMPI increases the difference between the two 
arithmetic. Is it the accentuation of the OpenMPI+Ethernet loss of 
performance, is it another issue into OpenMPI or is there any option a 
can use?


It is also unusual that the performance difference between MPICH and 
OMPI is so large.  You say that OMPI is slower than MPICH even at small 
process counts.  Can you confirm that this is because MPI calls are 
slower?  Some of the biggest performance differences I've seen between 
MPI implementations had nothing to do with the performance of MPI calls 
at all.  It had to do with process binding or other factors that 
impacted the computational (non-MPI) performance of the code.  The 
performance of MPI calls was basically irrelevant.


In this particular case, I'm not convinced since neither OMPI nor MPICH 
binds processes by default.


Still, can you do some basic performance profiling to confirm what 
aspect of your application is consuming so much time?  Is it a 
particular MPI call?  If your application is spending almost all of its 
time in MPI calls, do you have some way of judging whether the faster 
performance is acceptable?  That is, is 238 secs acceptable and 403 secs 
slow?  Or, are both timings unacceptable -- e.g., the code "should" be 
running in about 30 secs.