Re: [OMPI users] Scalability issue
Hi, I did some testing and felt like giving some feeback. When I started this discussion I compiled openmpi like that: ./configure --prefix=/home/toueg/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran *FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8"* --disable-mpi-f90 Now I compile openmpi like this: ./configure --prefix=/home/toueg/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran --disable-mpi-f90 I still have the segmentation fault I had: *** Process received signal *** Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x2c2579fc0 [ 0] /lib/libpthread.so.0 [0x7f52d2930410] [ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03] [ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2) [0x7f52d3504a1e] [ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e) [0x7f52d36cf9c6] It seems it doesn't change anything to compile openmpi with or without the options FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8". I'd like to stress that in both cases MPI_INTEGER size is 4-bytes long. I'll follow my own intuition and Jeff's advice that is using the same flags for compiling openmpi as compiling DRAGON. Thanks, Benjamin I always recommend using the same flags for compiling OMPI as compiling your > application. Of course, you can vary some flags that don't matter (e.g., > compiling your app with -g and compiling OMPI with -Ox). But for > "significant" behavior changes (like changing the size of INTEGER), they > should definitely match between your app and OMPI. > > > As per several previous discussions here in the list, > > I was persuaded to believe that MPI_INT / MPI_INTEGER is written > > in stone to be 4-bytes (perhaps by MPI standard, > > perhaps the configure script, maybe by both), > > Neither, actually. :-) > > The MPI spec is very, very careful not to mandate the size of int or > INTEGER at all. > > > and that "counts" in [Open]MPI would also be restricted to that size > > i.e., effectively up to 2147483647, if I counted right. > > *Most* commodity systems (excluding the embedded world) have 4 byte int's > these days, in part because most systems are this way (i.e., momentum). > Hence, when we talk about the 2B count limit, we're referring to the fact > that most systems where MPI is used default to 4 byte int's. > > > I may have inadvertently misled Benjamin, if this perception is wrong. > > I will gladly stand corrected, if this is so. > > > > You are the OpenMPI user's oracle (oops, sorry Cisco), > > so please speak out. > > Please buy Cisco stuff! :-p > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Scalability issue
On Dec 7, 2010, at 8:33 AM, Gus Correa wrote: > Did I understand you right? > > Are you saying that one can effectively double the counting > capability (i.e. the "count" parameters in MPI calls) of OpenMPI > by compiling it with 8-byte integer flags? Yes and no. If you increase the size of INTEGER *and* int, then hypothetically yes -- although I literally just got a report from someone today that tried a compiler flag to increase the size of C int to 8 bytes and something didn't work right (I don't think we've ever tried this before, so it's not surprising that there are likely some bugs in there). We have previously tested the increase-the-sizeof-INTEGER-to-8-bytes compiler flags and AFAIK, that's still working fine. When you call MPI_SEND with an INTEGER count, OMPI will truncate it down to the size of a C int (if we had 8 byte C ints working, this might be a different story). But keep in mind that increasing the size of C ints will likely cause problems in other areas -- are OS system calls that take int parameters firmly sized (i.e., int32 and the like)? I'm not so sure -- indeed, that might even be (one of the) problem(s) with the report that I got earlier today... > And long as one consistently uses the same flags to compile > the application, everything would work smoothly? I always recommend using the same flags for compiling OMPI as compiling your application. Of course, you can vary some flags that don't matter (e.g., compiling your app with -g and compiling OMPI with -Ox). But for "significant" behavior changes (like changing the size of INTEGER), they should definitely match between your app and OMPI. > As per several previous discussions here in the list, > I was persuaded to believe that MPI_INT / MPI_INTEGER is written > in stone to be 4-bytes (perhaps by MPI standard, > perhaps the configure script, maybe by both), Neither, actually. :-) The MPI spec is very, very careful not to mandate the size of int or INTEGER at all. > and that "counts" in [Open]MPI would also be restricted to that size > i.e., effectively up to 2147483647, if I counted right. *Most* commodity systems (excluding the embedded world) have 4 byte int's these days, in part because most systems are this way (i.e., momentum). Hence, when we talk about the 2B count limit, we're referring to the fact that most systems where MPI is used default to 4 byte int's. > I may have inadvertently misled Benjamin, if this perception is wrong. > I will gladly stand corrected, if this is so. > > You are the OpenMPI user's oracle (oops, sorry Cisco), > so please speak out. Please buy Cisco stuff! :-p -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Scalability issue
Hi Jeff Did I understand you right? Are you saying that one can effectively double the counting capability (i.e. the "count" parameters in MPI calls) of OpenMPI by compiling it with 8-byte integer flags? And long as one consistently uses the same flags to compile the application, everything would work smoothly? As per several previous discussions here in the list, I was persuaded to believe that MPI_INT / MPI_INTEGER is written in stone to be 4-bytes (perhaps by MPI standard, perhaps the configure script, maybe by both), and that "counts" in [Open]MPI would also be restricted to that size i.e., effectively up to 2147483647, if I counted right. I may have inadvertently misled Benjamin, if this perception is wrong. I will gladly stand corrected, if this is so. You are the OpenMPI user's oracle (oops, sorry Cisco), so please speak out. Cheers, Gus Correa Jeff Squyres wrote: It is always a good idea to have your application's sizeof(INTEGER) match the MPI's sizeof(INTEGER). Having them mismatch is a recipe for trouble. Meaning: if you're compiling your app with -make-integer-be-8-bytes, then you should configure/build Open MPI with that same flag. I'm thinking that this should *only* affect the back-end behavior of MPI_INTEGER; the size of address pointers and whatnot should not be affected (unless -make-integer-be-8-bytes also changes the sizes of some other types). On Dec 5, 2010, at 9:01 PM, Gustavo Correa wrote: Hi Benjamin I guess you could compile OpenMPI with standard integer and real sizes. Then compile your application (DRAGON) with the flags to change to 8-byte integers and 8-byte reals. We have some programs here that use real8 and are compiled this way, and run without a problem. I guess this is what Tim Prince was also telling you in his comments. You can pass those flags to the MPI compiler wrappers (mpif77 etc), which will relay them to gfortran when you compile DRAGON. I am not even sure if those flags would be accepted or ignored by OpenMPI when you build it. I guess they will be ignored. You could check this out by looking at the MPI type sizes in your header files in the include directory and subdirectories. Maybe an OpenMPI developer could shed some light here. Moreover, if I remember right, the MPI address type complies with the machine architecture, i.e., 32 bits if your machine is 32-bit, 64-bits if the machine is 64-bit, and you don't need to force it to be 8-bytes with compilation flags. Unfortunately mixing pointers ("Cray pointers", I suppose) with integers is a common source of headaches, if DRAGON does this. It is yet another possible situation where negative integers could crop in and lead to segmentation fault. At least one ocean circulation model we run here had many problems because of this mix of integers and (Cray) pointers spread all across the code. Gus Correa On Dec 5, 2010, at 7:17 PM, Benjamin Toueg wrote: Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of pointers. If I compile it in 64bits without -f-default-integer-8, the so-called pointers will remain in 32bits. Problems could also arise from its data structure handlers. Therefore -f-default-integer-8 is absolutely necessary. Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one source file (used for passing a data structure from one node to another) and it has proved to be working in every situtation. Not knowing which line is causing my segfault is annoying. <323.gif> Regards, Benjamin 2010/12/6 Gustavo CorreaHi Benjamin I would just rebuild OpenMPI withOUT the compiler flags that change the standard sizes of "int" and "float" (do a "make cleandist" first!), then recompile your program, and see how it goes. I don't think you are gaining anything by trying to change the standard "int/integer" and "real/float" sizdes, and most likely they are inviting trouble, making things more confusing. Worst scenario, you will at least be sure that the bug is somewhere else, not on the mismatch of basic type sizes. If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or MPI_REAL8 in your (Fortran) MPI calls, and declare them in the Fortran code accordingly (double precision or real(kind=8)). If I remember right, there is no 8-byte integer support in the Fortran MPI bindings, only in the C bindings, but some OpenMPI expert could clarify this. Hence, if you are passing 8-byte integers in your MPI calls this may be also problematic. My two cents, Gus Correa On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote: Hi, First of all thanks for your insight ! Do you get a corefile? I don't get a core file, but I get a file called _FIL001. It doesn't contain any debugging symbols. It's most likely a digested version of the input file given to the executable : ./myexec < inputfile. there's no line numbers printed in the stack trace I would love to see those, but even if I compile openmpi with
Re: [OMPI users] Scalability issue
It is always a good idea to have your application's sizeof(INTEGER) match the MPI's sizeof(INTEGER). Having them mismatch is a recipe for trouble. Meaning: if you're compiling your app with -make-integer-be-8-bytes, then you should configure/build Open MPI with that same flag. I'm thinking that this should *only* affect the back-end behavior of MPI_INTEGER; the size of address pointers and whatnot should not be affected (unless -make-integer-be-8-bytes also changes the sizes of some other types). On Dec 5, 2010, at 9:01 PM, Gustavo Correa wrote: > Hi Benjamin > > I guess you could compile OpenMPI with standard integer and real sizes. > Then compile your application (DRAGON) with the flags to change to 8-byte > integers and 8-byte reals. > We have some programs here that use real8 and are compiled this way, > and run without a problem. > I guess this is what Tim Prince was also telling you in his comments. > > You can pass those flags to the MPI compiler wrappers (mpif77 etc), > which will relay them to gfortran when you compile DRAGON. > > I am not even sure if those flags would be accepted or ignored by OpenMPI > when you build it. > I guess they will be ignored. > You could check this out by looking at the MPI type sizes in your header > files in the include directory and subdirectories. > > Maybe an OpenMPI developer could shed some light here. > > Moreover, if I remember right, > the MPI address type complies with the machine architecture, > i.e., 32 bits if your machine is 32-bit, 64-bits if the machine is 64-bit, > and you don't need to force it to be 8-bytes with compilation flags. > > Unfortunately mixing pointers ("Cray pointers", I suppose) > with integers is a common source of headaches, if DRAGON does this. > It is yet another possible situation where negative integers could crop in > and lead to segmentation fault. > At least one ocean circulation model we run here had > many problems because of this mix of integers and (Cray) pointers > spread all across the code. > > Gus Correa > > On Dec 5, 2010, at 7:17 PM, Benjamin Toueg wrote: > >> Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of >> pointers. If I compile it in 64bits without -f-default-integer-8, the >> so-called pointers will remain in 32bits. Problems could also arise from its >> data structure handlers. >> >> Therefore -f-default-integer-8 is absolutely necessary. >> >> Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one >> source file (used for passing a data structure from one node to another) and >> it has proved to be working in every situtation. >> >> Not knowing which line is causing my segfault is annoying. <323.gif> >> >> Regards, >> Benjamin >> >> 2010/12/6 Gustavo Correa>> Hi Benjamin >> >> I would just rebuild OpenMPI withOUT the compiler flags that change the >> standard >> sizes of "int" and "float" (do a "make cleandist" first!), then recompile >> your program, >> and see how it goes. >> I don't think you are gaining anything by trying to change the standard >> "int/integer" and >> "real/float" sizdes, and most likely they are inviting trouble, making >> things more confusing. >> Worst scenario, you will at least be sure that the bug is somewhere else, >> not on the mismatch >> of basic type sizes. >> >> If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or >> MPI_REAL8 >> in your (Fortran) MPI calls, and declare them in the Fortran code accordingly >> (double precision or real(kind=8)). >> >> If I remember right, there is no 8-byte integer support in the Fortran MPI >> bindings, >> only in the C bindings, but some OpenMPI expert could clarify this. >> Hence, if you are passing 8-byte integers in your MPI calls this may be also >> problematic. >> >> My two cents, >> Gus Correa >> >> On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote: >> >>> Hi, >>> >>> First of all thanks for your insight ! >>> >>> Do you get a corefile? >>> I don't get a core file, but I get a file called _FIL001. It doesn't >>> contain any debugging symbols. It's most likely a digested version of the >>> input file given to the executable : ./myexec < inputfile. >>> >>> there's no line numbers printed in the stack trace >>> I would love to see those, but even if I compile openmpi with -debug >>> -mem-debug -mem-profile, they don't show up. I recompiled my sources to be >>> sure to properly link them to the newly debugged version of openmpi. I >>> assumed I didn't need to compile my own sources with -g option since it >>> crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I >>> guess it wont help since I already get the trace. >>> >>> the -fdefault-integer-8 options ought to be highly dangerous >>> Thanks for noting. Indeed I had some issues with this option. For instance >>> I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in : >>> CALL
Re: [OMPI users] Scalability issue
Hi Benjamin I guess you could compile OpenMPI with standard integer and real sizes. Then compile your application (DRAGON) with the flags to change to 8-byte integers and 8-byte reals. We have some programs here that use real8 and are compiled this way, and run without a problem. I guess this is what Tim Prince was also telling you in his comments. You can pass those flags to the MPI compiler wrappers (mpif77 etc), which will relay them to gfortran when you compile DRAGON. I am not even sure if those flags would be accepted or ignored by OpenMPI when you build it. I guess they will be ignored. You could check this out by looking at the MPI type sizes in your header files in the include directory and subdirectories. Maybe an OpenMPI developer could shed some light here. Moreover, if I remember right, the MPI address type complies with the machine architecture, i.e., 32 bits if your machine is 32-bit, 64-bits if the machine is 64-bit, and you don't need to force it to be 8-bytes with compilation flags. Unfortunately mixing pointers ("Cray pointers", I suppose) with integers is a common source of headaches, if DRAGON does this. It is yet another possible situation where negative integers could crop in and lead to segmentation fault. At least one ocean circulation model we run here had many problems because of this mix of integers and (Cray) pointers spread all across the code. Gus Correa On Dec 5, 2010, at 7:17 PM, Benjamin Toueg wrote: > Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of > pointers. If I compile it in 64bits without -f-default-integer-8, the > so-called pointers will remain in 32bits. Problems could also arise from its > data structure handlers. > > Therefore -f-default-integer-8 is absolutely necessary. > > Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one > source file (used for passing a data structure from one node to another) and > it has proved to be working in every situtation. > > Not knowing which line is causing my segfault is annoying. <323.gif> > > Regards, > Benjamin > > 2010/12/6 Gustavo Correa> Hi Benjamin > > I would just rebuild OpenMPI withOUT the compiler flags that change the > standard > sizes of "int" and "float" (do a "make cleandist" first!), then recompile > your program, > and see how it goes. > I don't think you are gaining anything by trying to change the standard > "int/integer" and > "real/float" sizdes, and most likely they are inviting trouble, making things > more confusing. > Worst scenario, you will at least be sure that the bug is somewhere else, not > on the mismatch > of basic type sizes. > > If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or > MPI_REAL8 > in your (Fortran) MPI calls, and declare them in the Fortran code accordingly > (double precision or real(kind=8)). > > If I remember right, there is no 8-byte integer support in the Fortran MPI > bindings, > only in the C bindings, but some OpenMPI expert could clarify this. > Hence, if you are passing 8-byte integers in your MPI calls this may be also > problematic. > > My two cents, > Gus Correa > > On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote: > > > Hi, > > > > First of all thanks for your insight ! > > > > Do you get a corefile? > > I don't get a core file, but I get a file called _FIL001. It doesn't > > contain any debugging symbols. It's most likely a digested version of the > > input file given to the executable : ./myexec < inputfile. > > > > there's no line numbers printed in the stack trace > > I would love to see those, but even if I compile openmpi with -debug > > -mem-debug -mem-profile, they don't show up. I recompiled my sources to be > > sure to properly link them to the newly debugged version of openmpi. I > > assumed I didn't need to compile my own sources with -g option since it > > crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I > > guess it wont help since I already get the trace. > > > > the -fdefault-integer-8 options ought to be highly dangerous > > Thanks for noting. Indeed I had some issues with this option. For instance > > I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in : > > CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR) > > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR) > > In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, > > MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as > > you mentioned it could flip to the negative). However I haven't declared it > > as INTEGER*4 and I think I should. > > When I said "I had to raise the number of data strucutures to be sent", I > > meant that I had to call MPI_SEND many more times, not that buffers were > > bigger than before. > > > > I'll get back to you with more info when I'll be able to fix my connexion > > problem to the cluster... > > > > Thanks, > > Benjamin > > > > 2010/12/3 Martin Siegert
Re: [OMPI users] Scalability issue
Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of pointers. If I compile it in 64bits without -f-default-integer-8, the so-called pointers will remain in 32bits. Problems could also arise from its data structure handlers. Therefore -f-default-integer-8 is absolutely necessary. Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one source file (used for passing a data structure from one node to another) and it has proved to be working in every situtation. Not knowing which line is causing my segfault is annoying. [?] Regards, Benjamin 2010/12/6 Gustavo Correa> Hi Benjamin > > I would just rebuild OpenMPI withOUT the compiler flags that change the > standard > sizes of "int" and "float" (do a "make cleandist" first!), then recompile > your program, > and see how it goes. > I don't think you are gaining anything by trying to change the standard > "int/integer" and > "real/float" sizdes, and most likely they are inviting trouble, making > things more confusing. > Worst scenario, you will at least be sure that the bug is somewhere else, > not on the mismatch > of basic type sizes. > > If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or > MPI_REAL8 > in your (Fortran) MPI calls, and declare them in the Fortran code > accordingly > (double precision or real(kind=8)). > > If I remember right, there is no 8-byte integer support in the Fortran MPI > bindings, > only in the C bindings, but some OpenMPI expert could clarify this. > Hence, if you are passing 8-byte integers in your MPI calls this may be > also problematic. > > My two cents, > Gus Correa > > On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote: > > > Hi, > > > > First of all thanks for your insight ! > > > > Do you get a corefile? > > I don't get a core file, but I get a file called _FIL001. It doesn't > contain any debugging symbols. It's most likely a digested version of the > input file given to the executable : ./myexec < inputfile. > > > > there's no line numbers printed in the stack trace > > I would love to see those, but even if I compile openmpi with -debug > -mem-debug -mem-profile, they don't show up. I recompiled my sources to be > sure to properly link them to the newly debugged version of openmpi. I > assumed I didn't need to compile my own sources with -g option since it > crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I > guess it wont help since I already get the trace. > > > > the -fdefault-integer-8 options ought to be highly dangerous > > Thanks for noting. Indeed I had some issues with this option. For > instance I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR > in : > > CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR) > > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR) > > In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, > MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as > you mentioned it could flip to the negative). However I haven't declared it > as INTEGER*4 and I think I should. > > When I said "I had to raise the number of data strucutures to be sent", I > meant that I had to call MPI_SEND many more times, not that buffers were > bigger than before. > > > > I'll get back to you with more info when I'll be able to fix my connexion > problem to the cluster... > > > > Thanks, > > Benjamin > > > > 2010/12/3 Martin Siegert > > Hi All, > > > > just to expand on this guess ... > > > > On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote: > > > Hi All > > > > > > I wonder if configuring OpenMPI while > > > forcing the default types to non-default values > > > (-fdefault-integer-8 -fdefault-real-8) might have > > > something to do with the segmentation fault. > > > Would this be effective, i.e., actually make the > > > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger, > > > or just elusive? > > > > I believe what happens is that this mostly affects the fortran > > wrapper routines and the way Fortran variables are mapped to C: > > > > MPI_INTEGER -> MPI_LONG > > MPI_FLOAT -> MPI_DOUBLE > > MPI_DOUBLE_PRECISION -> MPI_DOUBLE > > > > In that respect I believe that the -fdefault-real-8 option is harmless, > > i.e., it does the expected thing. > > But the -fdefault-integer-8 options ought to be highly dangerous: > > It works for integer variables that are used as "buffer" arguments > > in MPI statements, but I would assume that this does not work for > > "count" and similar arguments. > > Example: > > > > integer, allocatable :: buf(*,*) > > integer i, count, dest, tag, mpierr > > > > i = 32768 > > i2 = 2*i > > allocate(buf(i,i2)) > > count = i*i2 > > buf = 1 > > call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr) > > > > Now count is 2^31 which overflows a 32bit integer. > > The MPI standard requires that count is a 32bit integer, correct? > > Thus while buf gets the type MPI_LONG, count remains an int. > >
Re: [OMPI users] Scalability issue
Hi Benjamin I would just rebuild OpenMPI withOUT the compiler flags that change the standard sizes of "int" and "float" (do a "make cleandist" first!), then recompile your program, and see how it goes. I don't think you are gaining anything by trying to change the standard "int/integer" and "real/float" sizdes, and most likely they are inviting trouble, making things more confusing. Worst scenario, you will at least be sure that the bug is somewhere else, not on the mismatch of basic type sizes. If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or MPI_REAL8 in your (Fortran) MPI calls, and declare them in the Fortran code accordingly (double precision or real(kind=8)). If I remember right, there is no 8-byte integer support in the Fortran MPI bindings, only in the C bindings, but some OpenMPI expert could clarify this. Hence, if you are passing 8-byte integers in your MPI calls this may be also problematic. My two cents, Gus Correa On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote: > Hi, > > First of all thanks for your insight ! > > Do you get a corefile? > I don't get a core file, but I get a file called _FIL001. It doesn't contain > any debugging symbols. It's most likely a digested version of the input file > given to the executable : ./myexec < inputfile. > > there's no line numbers printed in the stack trace > I would love to see those, but even if I compile openmpi with -debug > -mem-debug -mem-profile, they don't show up. I recompiled my sources to be > sure to properly link them to the newly debugged version of openmpi. I > assumed I didn't need to compile my own sources with -g option since it > crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I > guess it wont help since I already get the trace. > > the -fdefault-integer-8 options ought to be highly dangerous > Thanks for noting. Indeed I had some issues with this option. For instance I > have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in : > CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR) > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR) > In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, > MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as > you mentioned it could flip to the negative). However I haven't declared it > as INTEGER*4 and I think I should. > When I said "I had to raise the number of data strucutures to be sent", I > meant that I had to call MPI_SEND many more times, not that buffers were > bigger than before. > > I'll get back to you with more info when I'll be able to fix my connexion > problem to the cluster... > > Thanks, > Benjamin > > 2010/12/3 Martin Siegert> Hi All, > > just to expand on this guess ... > > On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote: > > Hi All > > > > I wonder if configuring OpenMPI while > > forcing the default types to non-default values > > (-fdefault-integer-8 -fdefault-real-8) might have > > something to do with the segmentation fault. > > Would this be effective, i.e., actually make the > > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger, > > or just elusive? > > I believe what happens is that this mostly affects the fortran > wrapper routines and the way Fortran variables are mapped to C: > > MPI_INTEGER -> MPI_LONG > MPI_FLOAT -> MPI_DOUBLE > MPI_DOUBLE_PRECISION -> MPI_DOUBLE > > In that respect I believe that the -fdefault-real-8 option is harmless, > i.e., it does the expected thing. > But the -fdefault-integer-8 options ought to be highly dangerous: > It works for integer variables that are used as "buffer" arguments > in MPI statements, but I would assume that this does not work for > "count" and similar arguments. > Example: > > integer, allocatable :: buf(*,*) > integer i, count, dest, tag, mpierr > > i = 32768 > i2 = 2*i > allocate(buf(i,i2)) > count = i*i2 > buf = 1 > call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr) > > Now count is 2^31 which overflows a 32bit integer. > The MPI standard requires that count is a 32bit integer, correct? > Thus while buf gets the type MPI_LONG, count remains an int. > Is this interpretation correct? If it is, then you are calling > MPI_Send with a count argument of -2147483648. > Which could result in a segmentation fault. > > Cheers, > Martin > > -- > Martin Siegert > Head, Research Computing > WestGrid/ComputeCanada Site Lead > IT Servicesphone: 778 782-4691 > Simon Fraser Universityfax: 778 782-4242 > Burnaby, British Columbia email: sieg...@sfu.ca > Canada V5A 1S6 > > > There were some recent discussions here about MPI > > limiting counts to MPI_INTEGER. > > Since Benjamin said he "had to raise the number of data structures", > > which eventually led to the the error, > > I wonder if he is inadvertently flipping to negative integer > > side of the 32-bit universe (i.e.
Re: [OMPI users] Scalability issue
Hi, First of all thanks for your insight ! *Do you get a corefile?* I don't get a core file, but I get a file called _FIL001. It doesn't contain any debugging symbols. It's most likely a digested version of the input file given to the executable : ./myexec < inputfile. *there's no line numbers printed in the stack trace* I would love to see those, but even if I compile openmpi with -debug -mem-debug -mem-profile, they don't show up. I recompiled my sources to be sure to properly link them to the newly debugged version of openmpi. I assumed I didn't need to compile my own sources with -g option since it crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I guess it wont help since I already get the trace. *the -fdefault-integer-8 options ought to be highly dangerous* Thanks for noting. Indeed I had some issues with this option. For instance I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR in : CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR) In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as you mentioned it could flip to the negative). However I haven't declared it as INTEGER*4 and I think I should. When I said "I had to raise the number of data strucutures to be sent", I meant that I had to call MPI_SEND many more times, not that buffers were bigger than before. I'll get back to you with more info when I'll be able to fix my connexion problem to the cluster... Thanks, Benjamin 2010/12/3 Martin Siegert> Hi All, > > just to expand on this guess ... > > On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote: > > Hi All > > > > I wonder if configuring OpenMPI while > > forcing the default types to non-default values > > (-fdefault-integer-8 -fdefault-real-8) might have > > something to do with the segmentation fault. > > Would this be effective, i.e., actually make the > > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger, > > or just elusive? > > I believe what happens is that this mostly affects the fortran > wrapper routines and the way Fortran variables are mapped to C: > > MPI_INTEGER -> MPI_LONG > MPI_FLOAT -> MPI_DOUBLE > MPI_DOUBLE_PRECISION -> MPI_DOUBLE > > In that respect I believe that the -fdefault-real-8 option is harmless, > i.e., it does the expected thing. > But the -fdefault-integer-8 options ought to be highly dangerous: > It works for integer variables that are used as "buffer" arguments > in MPI statements, but I would assume that this does not work for > "count" and similar arguments. > Example: > > integer, allocatable :: buf(*,*) > integer i, count, dest, tag, mpierr > > i = 32768 > i2 = 2*i > allocate(buf(i,i2)) > count = i*i2 > buf = 1 > call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr) > > Now count is 2^31 which overflows a 32bit integer. > The MPI standard requires that count is a 32bit integer, correct? > Thus while buf gets the type MPI_LONG, count remains an int. > Is this interpretation correct? If it is, then you are calling > MPI_Send with a count argument of -2147483648. > Which could result in a segmentation fault. > > Cheers, > Martin > > -- > Martin Siegert > Head, Research Computing > WestGrid/ComputeCanada Site Lead > IT Servicesphone: 778 782-4691 > Simon Fraser Universityfax: 778 782-4242 > Burnaby, British Columbia email: sieg...@sfu.ca > Canada V5A 1S6 > > > There were some recent discussions here about MPI > > limiting counts to MPI_INTEGER. > > Since Benjamin said he "had to raise the number of data structures", > > which eventually led to the the error, > > I wonder if he is inadvertently flipping to negative integer > > side of the 32-bit universe (i.e. >= 2**31), as was reported here by > > other list subscribers a few times. > > > > Anyway, segmentation fault can come from many different places, > > this is just a guess. > > > > Gus Correa > > > > Jeff Squyres wrote: > > >Do you get a corefile? > > > > > >It looks like you're calling MPI_RECV in Fortran and then it segv's. > This is *likely* because you're either passing a bad parameter or your > buffer isn't big enough. Can you double check all your parameters? > > > > > >Unfortunately, there's no line numbers printed in the stack trace, so > it's not possible to tell exactly where in the ob1 PML it's dying (i.e., so > we can't see exactly what it's doing to cause the segv). > > > > > > > > > > > >On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote: > > > > > >>Hi, > > >> > > >>I am using DRAGON, a neutronic simulation code in FORTRAN77 that has > its own datastructures. I added a module to send these data structures > thanks to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a > while. > > >> > > >>Then I had to raise the number of data structures to be sent up to a > point where my cluster
Re: [OMPI users] Scalability issue
Hi All, just to expand on this guess ... On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote: > Hi All > > I wonder if configuring OpenMPI while > forcing the default types to non-default values > (-fdefault-integer-8 -fdefault-real-8) might have > something to do with the segmentation fault. > Would this be effective, i.e., actually make the > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger, > or just elusive? I believe what happens is that this mostly affects the fortran wrapper routines and the way Fortran variables are mapped to C: MPI_INTEGER -> MPI_LONG MPI_FLOAT -> MPI_DOUBLE MPI_DOUBLE_PRECISION -> MPI_DOUBLE In that respect I believe that the -fdefault-real-8 option is harmless, i.e., it does the expected thing. But the -fdefault-integer-8 options ought to be highly dangerous: It works for integer variables that are used as "buffer" arguments in MPI statements, but I would assume that this does not work for "count" and similar arguments. Example: integer, allocatable :: buf(*,*) integer i, count, dest, tag, mpierr i = 32768 i2 = 2*i allocate(buf(i,i2)) count = i*i2 buf = 1 call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr) Now count is 2^31 which overflows a 32bit integer. The MPI standard requires that count is a 32bit integer, correct? Thus while buf gets the type MPI_LONG, count remains an int. Is this interpretation correct? If it is, then you are calling MPI_Send with a count argument of -2147483648. Which could result in a segmentation fault. Cheers, Martin -- Martin Siegert Head, Research Computing WestGrid/ComputeCanada Site Lead IT Servicesphone: 778 782-4691 Simon Fraser Universityfax: 778 782-4242 Burnaby, British Columbia email: sieg...@sfu.ca Canada V5A 1S6 > There were some recent discussions here about MPI > limiting counts to MPI_INTEGER. > Since Benjamin said he "had to raise the number of data structures", > which eventually led to the the error, > I wonder if he is inadvertently flipping to negative integer > side of the 32-bit universe (i.e. >= 2**31), as was reported here by > other list subscribers a few times. > > Anyway, segmentation fault can come from many different places, > this is just a guess. > > Gus Correa > > Jeff Squyres wrote: > >Do you get a corefile? > > > >It looks like you're calling MPI_RECV in Fortran and then it segv's. This > >is *likely* because you're either passing a bad parameter or your buffer > >isn't big enough. Can you double check all your parameters? > > > >Unfortunately, there's no line numbers printed in the stack trace, so it's > >not possible to tell exactly where in the ob1 PML it's dying (i.e., so we > >can't see exactly what it's doing to cause the segv). > > > > > > > >On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote: > > > >>Hi, > >> > >>I am using DRAGON, a neutronic simulation code in FORTRAN77 that has its > >>own datastructures. I added a module to send these data structures thanks > >>to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a while. > >> > >>Then I had to raise the number of data structures to be sent up to a point > >>where my cluster has this bug : > >>*** Process received signal *** > >>Signal: Segmentation fault (11) > >>Signal code: Address not mapped (1) > >>Failing at address: 0x2c2579fc0 > >>[ 0] /lib/libpthread.so.0 [0x7f52d2930410] > >>[ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03] > >>[ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2) [0x7f52d3504a1e] > >>[ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e) > >>[0x7f52d36cf9c6] > >> > >>How can I make this error more explicit ? > >> > >>I use the following configuration of openmpi-1.4.3 : > >>./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++ CC=gcc > >>F77=gfortran FC=gfortran FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 > >>-fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 > >>-fdefault-double-8" --disable-mpi-f90 > >> > >>Here is the output of mpif77 -v : > >>mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51 > >>Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt > >>-lfrtbegin -lg2c -lm -shared-libgcc > >>Lecture des spécification à partir de > >>/usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs > >>Configuré avec: ../src/configure -v --enable-languages=c,c++,f77,pascal > >>--prefix=/usr --libexecdir=/usr/lib > >>--with-gxx-include-dir=/usr/include/c++/3.4 --enable-shared > >>--with-system-zlib --enable-nls --without-included-gettext > >>--program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu > >>--enable-libstdcxx-debug x86_64-linux-gnu > >>Modèle de thread: posix > >>version gcc 3.4.6 (Debian 3.4.6-5) > >> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m elf_x86_64 > >> -dynamic-linker /lib64/ld-linux-x86-64.so.2 > >> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o
Re: [OMPI users] Scalability issue
Hi All I wonder if configuring OpenMPI while forcing the default types to non-default values (-fdefault-integer-8 -fdefault-real-8) might have something to do with the segmentation fault. Would this be effective, i.e., actually make the the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger, or just elusive? There were some recent discussions here about MPI limiting counts to MPI_INTEGER. Since Benjamin said he "had to raise the number of data structures", which eventually led to the the error, I wonder if he is inadvertently flipping to negative integer side of the 32-bit universe (i.e. >= 2**31), as was reported here by other list subscribers a few times. Anyway, segmentation fault can come from many different places, this is just a guess. Gus Correa Jeff Squyres wrote: Do you get a corefile? It looks like you're calling MPI_RECV in Fortran and then it segv's. This is *likely* because you're either passing a bad parameter or your buffer isn't big enough. Can you double check all your parameters? Unfortunately, there's no line numbers printed in the stack trace, so it's not possible to tell exactly where in the ob1 PML it's dying (i.e., so we can't see exactly what it's doing to cause the segv). On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote: Hi, I am using DRAGON, a neutronic simulation code in FORTRAN77 that has its own datastructures. I added a module to send these data structures thanks to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a while. Then I had to raise the number of data structures to be sent up to a point where my cluster has this bug : *** Process received signal *** Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x2c2579fc0 [ 0] /lib/libpthread.so.0 [0x7f52d2930410] [ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03] [ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2) [0x7f52d3504a1e] [ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e) [0x7f52d36cf9c6] How can I make this error more explicit ? I use the following configuration of openmpi-1.4.3 : ./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 -fdefault-double-8" --disable-mpi-f90 Here is the output of mpif77 -v : mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51 Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm -shared-libgcc Lecture des spécification à partir de /usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs Configuré avec: ../src/configure -v --enable-languages=c,c++,f77,pascal --prefix=/usr --libexecdir=/usr/lib --with-gxx-include-dir=/usr/include/c++/3.4 --enable-shared --with-system-zlib --enable-nls --without-included-gettext --program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug x86_64-linux-gnu Modèle de thread: posix version gcc 3.4.6 (Debian 3.4.6-5) /usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crti.o /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtbegin.o -L/usr/lib/mpich-mpd/lib -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../.. -L/lib/../lib -L/usr/lib/../lib -lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm -lgcc_s -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtend.o /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crtn.o /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/libfrtbegin.a(frtbegin.o): dans la fonction ▒ main ▒: (.text+0x1e): référence indéfinie vers ▒ MAIN__ ▒ collect2: ld a retourné 1 code d'état d'exécution Thanks, Benjamin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Scalability issue
Do you get a corefile? It looks like you're calling MPI_RECV in Fortran and then it segv's. This is *likely* because you're either passing a bad parameter or your buffer isn't big enough. Can you double check all your parameters? Unfortunately, there's no line numbers printed in the stack trace, so it's not possible to tell exactly where in the ob1 PML it's dying (i.e., so we can't see exactly what it's doing to cause the segv). On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote: > Hi, > > I am using DRAGON, a neutronic simulation code in FORTRAN77 that has its own > datastructures. I added a module to send these data structures thanks to > MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a while. > > Then I had to raise the number of data structures to be sent up to a point > where my cluster has this bug : > *** Process received signal *** > Signal: Segmentation fault (11) > Signal code: Address not mapped (1) > Failing at address: 0x2c2579fc0 > [ 0] /lib/libpthread.so.0 [0x7f52d2930410] > [ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03] > [ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2) [0x7f52d3504a1e] > [ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e) > [0x7f52d36cf9c6] > > How can I make this error more explicit ? > > I use the following configuration of openmpi-1.4.3 : > ./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++ CC=gcc > F77=gfortran FC=gfortran FLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 > -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8 -fdefault-real-8 > -fdefault-double-8" --disable-mpi-f90 > > Here is the output of mpif77 -v : > mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51 > Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt > -lfrtbegin -lg2c -lm -shared-libgcc > Lecture des spécification à partir de > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs > Configuré avec: ../src/configure -v --enable-languages=c,c++,f77,pascal > --prefix=/usr --libexecdir=/usr/lib > --with-gxx-include-dir=/usr/include/c++/3.4 --enable-shared > --with-system-zlib --enable-nls --without-included-gettext > --program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu > --enable-libstdcxx-debug x86_64-linux-gnu > Modèle de thread: posix > version gcc 3.4.6 (Debian 3.4.6-5) > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m elf_x86_64 > -dynamic-linker /lib64/ld-linux-x86-64.so.2 > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crti.o > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtbegin.o -L/usr/lib/mpich-mpd/lib > -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 > -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib > -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../.. -L/lib/../lib > -L/usr/lib/../lib -lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm -lgcc_s > -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtend.o > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crtn.o > /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/libfrtbegin.a(frtbegin.o): > dans la fonction ▒ main ▒: > (.text+0x1e): référence indéfinie vers ▒ MAIN__ ▒ > collect2: ld a retourné 1 code d'état d'exécution > > Thanks, > Benjamin > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/