Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-06 Thread Carsten Kutzner
On Wed, 4 Jan 2006, Jeff Squyres wrote:

> On Jan 4, 2006, at 2:08 PM, Anthony Chan wrote:
>
> >> Either my program quits without writing the logfile (and without
> >> complaining) or it crashes in MPI_Finalize. I get the message
> >> "33 additional processes aborted (not shown)".
> >
> > This is not MPE error message.  If the logging crashes in
> > MPI_Finalize,
> > it usually means the merging of logging data from child nodes fails.
> > Since you didn't get any MPE error messages, so it means the cause of
> > the crash isn't expected by MPE.  Does anyone know if "33 additional
> > processes aborted (not shown)" is from OpenMPI ?
>
> Yes, it is.  It is from mpirun telling you that 33 processes -- in
> addition to the error message that it must have shown above that --
> aborted.  So I'm guessing that 34 total processes aborted.
>
> Are you getting corefiles for these processes?  (might need to check
> the limit of your coredumpsize)

Anthony, thanks for your suggestions. I tried the cpilog.c program with
logging and it also crashes when using more than 33 (!) processes. This
also happens when I let it run on a single node - so it is not due to
some network settings.

Actually it seems to depend on the OpenMPI version I use. With version
1.0.1 it works, and I have a logfile for 128 CPUs now. With the nightly
tarball version 1.1a1r8626 (tuned collectives) it does not work (I get
no corefile)

For 33 processes I get:
---
ckutzne@wes:~/mpe2test> mpirun -np 33 ./cpilog.x
Process 0 running on wes
Process 31 running on wes
...
Process 30 running on wes
Process 21 running on wes
pi is approximately 3.1415926535898770, Error is 0.0839
wall clock time = 0.449936
Writing logfile
Enabling the synchronization of the clocks...
Finished writing logfile ./cpilog.x.clog2.
---

For 34 processes I get something like (slighly shortened):
---
ckutzne@wes:~/mpe2test> mpirun -np 34 ./cpilog.x
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0 [0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa0]
[3]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_dec_fixed+0x6d)
[0x403f376d]
[4]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_nonoverlapping+0x2b)
[0x403f442b]
[5]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x30)
[0x403f34c0]
[6]
func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libmpi.so.0(PMPI_Allreduce+0x1bb)
[0x40069d9b]
[7] func:./cpilog.x(CLOG_Sync_init+0x125) [0x805e84b]
[8] func:./cpilog.x(CLOG_Local_init+0x82) [0x805c4b6]
[9] func:./cpilog.x(MPE_Init_log+0x37) [0x8059fd3]
[10] func:./cpilog.x(MPI_Init+0x20) [0x805206d]
[11] func:./cpilog.x(main+0x43) [0x804f325]
[12] func:/lib/i686/libc.so.6(__libc_start_main+0xc7) [0x401eed17]
[13] func:./cpilog.x(free+0x49) [0x804f221]
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
...
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
mpirun noticed that job rank 1 with PID 9014 on node "localhost" exited on
signal 11.
*** End of error message ***
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x88
*** End of error message ***
...
2[0] func:/home/ckutzne/ompi1.1a1r8626-gcc331/lib/libopal.so.0
[0x40103579]
[1] func:/lib/i686/libpthread.so.0 [0x40193a05]
[2] func:/lib/i686/libc.so.6 [0x40202aa

Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-06 Thread Graham E Fagg

Looks like the problem is somewhere in the tuned collectives?
Unfortunately I need a logfile with exactly those :(

  Carsten


I hope not. Carsten can you send me your configure line (not the whole 
log) and any other things your set in your .mca conf file. Is this with 
the changed (custom) decision function or the standard one??


G.





---
Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckut...@gwdg.de
http://www.gwdg.de/~ckutzne

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Thanks,
Graham.
--
Dr Graham E. Fagg   | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept   | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: f...@cs.utk.edu  | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
--


Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-06 Thread Jeff Squyres

On Jan 6, 2006, at 8:13 AM, Carsten Kutzner wrote:


Looks like the problem is somewhere in the tuned collectives?
Unfortunately I need a logfile with exactly those :(


FWIW, we just activated these tuned collectives on the trunk (which  
will eventually become the 1.1.x series; the tuned collectives don't  
exist in the 1.0.x series).


Until right before the holidays, the tuned collectives were developed/ 
tested only by a small subset of the Open MPI developers.  Whenever  
we turn on any new functionality in the code base, it's inevitable  
that some bugs will be exposed by testing by other developers/users  
-- so thanks for your patience!


We also just [re-]activated the stack-tracing facility so that one  
can get some at-least-somewhat helpful information upon SIGFPE,  
SIGSEGV, and SIGBUS -- that's where those stack traces are coming  
from.  This also does not exist in the 1.0.x series.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] Performance of all-to-all on Gbit Ethernet

2006-01-06 Thread Carsten Kutzner
On Fri, 6 Jan 2006, Graham E Fagg wrote:

> > Looks like the problem is somewhere in the tuned collectives?
> > Unfortunately I need a logfile with exactly those :(
> >
> >   Carsten
>
> I hope not. Carsten can you send me your configure line (not the whole
> log) and any other things your set in your .mca conf file. Is this with
> the changed (custom) decision function or the standard one??

I get the problems with custom decision function as well as without. Today
I downloaded a clean tarball 1.1a1r8626 and changed nothing. I simply
configure with

$ ./configure --prefix=/home/ckutzne/ompi1.1a1r8626-gcc331

Then make all install and that's it. I both tried gcc3.3.1 and gcc4.0.2.

Then I install MPE from mpe2.tar.gz with
 ./configure MPI_CC=/home/ckutzne/ompi1.1a1r8626-gcc331/bin/mpicc \
  CC=/usr/bin/gcc \
  MPI_F77=/home/ckutzne/ompi1.1a1r8626-gcc331/bin/mpif77 \
  F77=/usr/bin/gcc \
  --prefix=/home/ckutzne/mpe2-ompi1.1a1r8626-gcc331
make
make install
make installcheck --> ok!

I did not set anything in an .mca conf file (do I have to?)

  Carsten



Re: [O-MPI users] Open MPI and gfortran

2006-01-06 Thread Jeff Squyres
It looks like the files you sent were corrupted -- I didn't see the  
information that I needed to see.  Were you working on a case- 
insensitive filesystem, perchance?  I notice that our instructions on  
the web page would probably result in this kind of corruption for  
case-insensitive filesystems.  I've updated the web page to make the  
instructions work on case-insensitive filesystems -- can you go check  
the instructions, do it again and re-send?  Sorry about that.  :-\


Specifically, your config.log file had a big chunk of the beginning  
missing -- it was overlaid with the output of configure (which our  
instructions previously had you write to config.LOG, and could create  
this kind of problem on a case-insensitive filesystem).


FWIW, I just built Open MPI 1.0.1 on a RHEL4U2 machine with gfortran  
4.0.2; it correctly identified that there was no real(16) support and  
didn't run into the problems that you are seeing (i.e., it didn't try  
to make MPI F90 bindings with real(16) parameters).  So I'm curious  
to see your full logs to figure out why it's failing for you.




On Jan 5, 2006, at 7:49 PM, Jyh-Shyong Ho wrote:


Dear Jeff,

Thanks for yor reply. I checked and confirmed that my gfortran is  
version 4.0.2, so the

test program failed since it does not support real(16).
The log files for configure and make are attached.
It is strange since I am able to use the same configuration and  
build OpenMPI successfully

on another SuSE10/AMD 64 computer. Something must be missing.

Best regards

Jyh-Shyong Ho, Ph.D.
Research Scientist
National Center for  High Performance Computing
Hsinchu, Taiwan, ROC

Jeff Squyres wrote:
What concerns me, though, is that Open MPI shouldn't have tried to  
compile support for real(16) in the first place -- our configure  
script should have detected that the compiler didn't support real 
(16) (which, it at least partially did, because the constants seem  
to have a value of -1) and then the generated F90 bindings should  
not have included support for it. This is why I'd like to see the  
configure output (etc.) and see what happened. On Jan 5, 2006, at  
12:59 PM, rod mach wrote:
Hi. To my knowledge you must be using gfortran 4.1 not 4.0 to get  
access to large kind support like real(16) You can verify by  
trying to compile the following code with gfortran. This compiles  
under gfortran 4.1, but I don't believe it will work under 4.0  
since this support was added in 4.1. program test real(16) :: x,  
y y = 4.0_16 x = sqrt(y) print *, x end --Rod -- Rod Mach  
Absoft HPC Technical Director www.absoft.com
Error: Kind -1 not supported for type REAL at (1) In file  
mpi_address_f90.f90:331 make[2]: Leaving directory `/work/source/ 
openmpi-1.0.1/ompi/mpi' make[1]: *** [all-recursive] Error 1 make 
[1]: Leaving directory `/work/source/openmpi-1.0.1/ompi' make:  
*** [all-recursive] Error 1 I used the following variables:  
FC=gfortran CC=gcc CXX=g++ F77=gfortran Any hint on how to solve  
this problem? Thanks. Jyh-Shyong Ho, Ph.D. Research Scientist  
National Center for High Performance Computing Hsinchu, Taiwan, ROC
___ users mailing  
list us...@open-mpi.org http://www.open-mpi.org/mailman/ 
listinfo.cgi/users
-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open- 
mpi.org/ ___ users  
mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ 
listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/




Re: [O-MPI users] Open MPI and gfortran

2006-01-06 Thread Jyh-Shyong Ho
Sorry,  here are the files again. Something went wrong when I compressed 
these files.


Jyh-Shyong Ho

Jeff Squyres wrote:

It looks like the files you sent were corrupted -- I didn't see the  
information that I needed to see.  Were you working on a case- 
insensitive filesystem, perchance?  I notice that our instructions on  
the web page would probably result in this kind of corruption for  
case-insensitive filesystems.  I've updated the web page to make the  
instructions work on case-insensitive filesystems -- can you go check  
the instructions, do it again and re-send?  Sorry about that.  :-\


Specifically, your config.log file had a big chunk of the beginning  
missing -- it was overlaid with the output of configure (which our  
instructions previously had you write to config.LOG, and could create  
this kind of problem on a case-insensitive filesystem).


FWIW, I just built Open MPI 1.0.1 on a RHEL4U2 machine with gfortran  
4.0.2; it correctly identified that there was no real(16) support and  
didn't run into the problems that you are seeing (i.e., it didn't try  
to make MPI F90 bindings with real(16) parameters).  So I'm curious  
to see your full logs to figure out why it's failing for you.




On Jan 5, 2006, at 7:49 PM, Jyh-Shyong Ho wrote:

 


Dear Jeff,

Thanks for yor reply. I checked and confirmed that my gfortran is  
version 4.0.2, so the

test program failed since it does not support real(16).
The log files for configure and make are attached.
It is strange since I am able to use the same configuration and  
build OpenMPI successfully

on another SuSE10/AMD 64 computer. Something must be missing.

Best regards

Jyh-Shyong Ho, Ph.D.
Research Scientist
National Center for  High Performance Computing
Hsinchu, Taiwan, ROC

Jeff Squyres wrote:
   

What concerns me, though, is that Open MPI shouldn't have tried to  
compile support for real(16) in the first place -- our configure  
script should have detected that the compiler didn't support real 
(16) (which, it at least partially did, because the constants seem  
to have a value of -1) and then the generated F90 bindings should  
not have included support for it. This is why I'd like to see the  
configure output (etc.) and see what happened. On Jan 5, 2006, at  
12:59 PM, rod mach wrote:
 

Hi. To my knowledge you must be using gfortran 4.1 not 4.0 to get  
access to large kind support like real(16) You can verify by  
trying to compile the following code with gfortran. This compiles  
under gfortran 4.1, but I don't believe it will work under 4.0  
since this support was added in 4.1. program test real(16) :: x,  
y y = 4.0_16 x = sqrt(y) print *, x end --Rod -- Rod Mach  
Absoft HPC Technical Director www.absoft.com
   

Error: Kind -1 not supported for type REAL at (1) In file  
mpi_address_f90.f90:331 make[2]: Leaving directory `/work/source/ 
openmpi-1.0.1/ompi/mpi' make[1]: *** [all-recursive] Error 1 make 
[1]: Leaving directory `/work/source/openmpi-1.0.1/ompi' make:  
*** [all-recursive] Error 1 I used the following variables:  
FC=gfortran CC=gcc CXX=g++ F77=gfortran Any hint on how to solve  
this problem? Thanks. Jyh-Shyong Ho, Ph.D. Research Scientist  
National Center for High Performance Computing Hsinchu, Taiwan, ROC
 

___ users mailing  
list us...@open-mpi.org http://www.open-mpi.org/mailman/ 
listinfo.cgi/users
   

-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open- 
mpi.org/ ___ users  
mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ 
listinfo.cgi/users
 




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
   




--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
 





config.log.tar.bz2
Description: Binary data


make.log.tar.bz2
Description: Binary data