Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet
On Wed, 4 Jan 2006, Jeff Squyres wrote: > On Jan 4, 2006, at 2:08 PM, Anthony Chan wrote: > > >> Either my program quits without writing the logfile (and without > >> complaining) or it crashes in MPI_Finalize. I get the message > >> "33 additional processes aborted (not shown)". > > > > This is not MPE error message. If the logging crashes in > > MPI_Finalize, > > it usually means the merging of logging data from child nodes fails. > > Since you didn't get any MPE error messages, so it means the cause of > > the crash isn't expected by MPE. Does anyone know if "33 additional > > processes aborted (not shown)" is from OpenMPI ? > > Yes, it is. It is from mpirun telling you that 33 processes -- in > addition to the error message that it must have shown above that -- > aborted. So I'm guessing that 34 total processes aborted. > > Are you getting corefiles for these processes? (might need to check > the limit of your coredumpsize) Anthony, thanks for your suggestions. I tried the cpilog.c program with logging and it also crashes when using more than 33 (!) processes. This also happens when I let it run on a single node - so it is not due to some network settings. Actually it seems to depend on the OpenMPI version I use. With version 1.0.1 it works, and I have a logfile for 128 CPUs now. With the nightly tarball version 1.1a1r8626 (tuned collectives) it does not work (I get no corefile) For 33 processes I get: --- ckutzne@wes:~/mpe2test> mpirun -np 33 ./cpilog.x Process 0 running on wes Process 31 running on wes ... Process 30 running on wes Process 21 running on wes pi is approximately 3.1415926535898770, Error is 0.0839 wall clock time = 0.449936 Writing logfile Enabling the synchronization of the clocks... Finished writing logfile ./cpilog.x.clog2. --- For 34 processes I get something like (slighly shortened): --- ckutzne@wes:~/mpe2test> mpirun -np 34 ./cpilog.x Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** [0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** [0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa0] [3] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d) [0x403f376d] [4] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b) [0x403f442b] [5] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30) [0x403f34c0] [6] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb) [0x40069d9b] [7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b] [8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6] [9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3] [10] func:./cpilog.x(MPI_Init+0x20) [0x805206d] [11] func:./cpilog.x(main+0x43) [0x804f325] [12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17] [13] func:./cpilog.x(free+0x49) [0x804f221] Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 ... *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 mpirun noticed that job rank 1 with PID 9014 on node "localhost" exited on signal 11. *** End of error message *** Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at addr:0x88 *** End of error message *** ... 2[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579] [1] func:/lib/i686/libpthread.so.0 [0x40193a05] [2] func:/lib/i686/libc.so.6 [0x40202aa
Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet
Looks like the problem is somewhere in the tuned collectives? Unfortunately I need a logfile with exactly those :( Carsten I hope not. Carsten can you send me your configure line (not the whole log) and any other things your set in your .mca conf file. Is this with the changed (custom) decision function or the standard one?? G. --- Dr. Carsten Kutzner Max Planck Institute for Biophysical Chemistry Theoretical and Computational Biophysics Department Am Fassberg 11 37077 Goettingen, Germany Tel. +49-551-2012313, Fax: +49-551-2012302 eMail ckut...@gwdg.de http://www.gwdg.de/~ckutzne ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Thanks, Graham. -- Dr Graham E. Fagg | Distributed, Parallel and Meta-Computing Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI Computer Science Dept | Suite 203, 1122 Volunteer Blvd, University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450 Email: f...@cs.utk.edu | Phone:+1(865)974-5790 | Fax:+1(865)974-8296 Broken complex systems are always derived from working simple systems --
Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet
On Jan 6, 2006, at 8:13 AM, Carsten Kutzner wrote: Looks like the problem is somewhere in the tuned collectives? Unfortunately I need a logfile with exactly those :( FWIW, we just activated these tuned collectives on the trunk (which will eventually become the 1.1.x series; the tuned collectives don't exist in the 1.0.x series). Until right before the holidays, the tuned collectives were developed/ tested only by a small subset of the Open MPI developers. Whenever we turn on any new functionality in the code base, it's inevitable that some bugs will be exposed by testing by other developers/users -- so thanks for your patience! We also just [re-]activated the stack-tracing facility so that one can get some at-least-somewhat helpful information upon SIGFPE, SIGSEGV, and SIGBUS -- that's where those stack traces are coming from. This also does not exist in the 1.0.x series. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet
On Fri, 6 Jan 2006, Graham E Fagg wrote: > > Looks like the problem is somewhere in the tuned collectives? > > Unfortunately I need a logfile with exactly those :( > > > > Carsten > > I hope not. Carsten can you send me your configure line (not the whole > log) and any other things your set in your .mca conf file. Is this with > the changed (custom) decision function or the standard one?? I get the problems with custom decision function as well as without. Today I downloaded a clean tarball 1.1a1r8626 and changed nothing. I simply configure with $ ./configure --prefix=/home/ckutzne/ompi1.1a1r8626-gcc331 Then make all install and that's it. I both tried gcc3.3.1 and gcc4.0.2. Then I install MPE from mpe2.tar.gz with ./configure MPI_CC=/home/ckutzne/ompi1.1a1r8626-gcc331/bin/mpicc \ CC=/usr/bin/gcc \ MPI_F77=/home/ckutzne/ompi1.1a1r8626-gcc331/bin/mpif77 \ F77=/usr/bin/gcc \ --prefix=/home/ckutzne/mpe2-ompi1.1a1r8626-gcc331 make make install make installcheck --> ok! I did not set anything in an .mca conf file (do I have to?) Carsten
Re: [O-MPI users] Open MPI and gfortran
It looks like the files you sent were corrupted -- I didn't see the information that I needed to see. Were you working on a case- insensitive filesystem, perchance? I notice that our instructions on the web page would probably result in this kind of corruption for case-insensitive filesystems. I've updated the web page to make the instructions work on case-insensitive filesystems -- can you go check the instructions, do it again and re-send? Sorry about that. :-\ Specifically, your config.log file had a big chunk of the beginning missing -- it was overlaid with the output of configure (which our instructions previously had you write to config.LOG, and could create this kind of problem on a case-insensitive filesystem). FWIW, I just built Open MPI 1.0.1 on a RHEL4U2 machine with gfortran 4.0.2; it correctly identified that there was no real(16) support and didn't run into the problems that you are seeing (i.e., it didn't try to make MPI F90 bindings with real(16) parameters). So I'm curious to see your full logs to figure out why it's failing for you. On Jan 5, 2006, at 7:49 PM, Jyh-Shyong Ho wrote: Dear Jeff, Thanks for yor reply. I checked and confirmed that my gfortran is version 4.0.2, so the test program failed since it does not support real(16). The log files for configure and make are attached. It is strange since I am able to use the same configuration and build OpenMPI successfully on another SuSE10/AMD 64 computer. Something must be missing. Best regards Jyh-Shyong Ho, Ph.D. Research Scientist National Center for High Performance Computing Hsinchu, Taiwan, ROC Jeff Squyres wrote: What concerns me, though, is that Open MPI shouldn't have tried to compile support for real(16) in the first place -- our configure script should have detected that the compiler didn't support real (16) (which, it at least partially did, because the constants seem to have a value of -1) and then the generated F90 bindings should not have included support for it. This is why I'd like to see the configure output (etc.) and see what happened. On Jan 5, 2006, at 12:59 PM, rod mach wrote: Hi. To my knowledge you must be using gfortran 4.1 not 4.0 to get access to large kind support like real(16) You can verify by trying to compile the following code with gfortran. This compiles under gfortran 4.1, but I don't believe it will work under 4.0 since this support was added in 4.1. program test real(16) :: x, y y = 4.0_16 x = sqrt(y) print *, x end --Rod -- Rod Mach Absoft HPC Technical Director www.absoft.com Error: Kind -1 not supported for type REAL at (1) In file mpi_address_f90.f90:331 make[2]: Leaving directory `/work/source/ openmpi-1.0.1/ompi/mpi' make[1]: *** [all-recursive] Error 1 make [1]: Leaving directory `/work/source/openmpi-1.0.1/ompi' make: *** [all-recursive] Error 1 I used the following variables: FC=gfortran CC=gcc CXX=g++ F77=gfortran Any hint on how to solve this problem? Thanks. Jyh-Shyong Ho, Ph.D. Research Scientist National Center for High Performance Computing Hsinchu, Taiwan, ROC ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open- mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] Open MPI and gfortran
Sorry, here are the files again. Something went wrong when I compressed these files. Jyh-Shyong Ho Jeff Squyres wrote: It looks like the files you sent were corrupted -- I didn't see the information that I needed to see. Were you working on a case- insensitive filesystem, perchance? I notice that our instructions on the web page would probably result in this kind of corruption for case-insensitive filesystems. I've updated the web page to make the instructions work on case-insensitive filesystems -- can you go check the instructions, do it again and re-send? Sorry about that. :-\ Specifically, your config.log file had a big chunk of the beginning missing -- it was overlaid with the output of configure (which our instructions previously had you write to config.LOG, and could create this kind of problem on a case-insensitive filesystem). FWIW, I just built Open MPI 1.0.1 on a RHEL4U2 machine with gfortran 4.0.2; it correctly identified that there was no real(16) support and didn't run into the problems that you are seeing (i.e., it didn't try to make MPI F90 bindings with real(16) parameters). So I'm curious to see your full logs to figure out why it's failing for you. On Jan 5, 2006, at 7:49 PM, Jyh-Shyong Ho wrote: Dear Jeff, Thanks for yor reply. I checked and confirmed that my gfortran is version 4.0.2, so the test program failed since it does not support real(16). The log files for configure and make are attached. It is strange since I am able to use the same configuration and build OpenMPI successfully on another SuSE10/AMD 64 computer. Something must be missing. Best regards Jyh-Shyong Ho, Ph.D. Research Scientist National Center for High Performance Computing Hsinchu, Taiwan, ROC Jeff Squyres wrote: What concerns me, though, is that Open MPI shouldn't have tried to compile support for real(16) in the first place -- our configure script should have detected that the compiler didn't support real (16) (which, it at least partially did, because the constants seem to have a value of -1) and then the generated F90 bindings should not have included support for it. This is why I'd like to see the configure output (etc.) and see what happened. On Jan 5, 2006, at 12:59 PM, rod mach wrote: Hi. To my knowledge you must be using gfortran 4.1 not 4.0 to get access to large kind support like real(16) You can verify by trying to compile the following code with gfortran. This compiles under gfortran 4.1, but I don't believe it will work under 4.0 since this support was added in 4.1. program test real(16) :: x, y y = 4.0_16 x = sqrt(y) print *, x end --Rod -- Rod Mach Absoft HPC Technical Director www.absoft.com Error: Kind -1 not supported for type REAL at (1) In file mpi_address_f90.f90:331 make[2]: Leaving directory `/work/source/ openmpi-1.0.1/ompi/mpi' make[1]: *** [all-recursive] Error 1 make [1]: Leaving directory `/work/source/openmpi-1.0.1/ompi' make: *** [all-recursive] Error 1 I used the following variables: FC=gfortran CC=gcc CXX=g++ F77=gfortran Any hint on how to solve this problem? Thanks. Jyh-Shyong Ho, Ph.D. Research Scientist National Center for High Performance Computing Hsinchu, Taiwan, ROC ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open- mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users config.log.tar.bz2 Description: Binary data make.log.tar.bz2 Description: Binary data