Hi,

@Gus I don't use any flags for the installed OpenMPI version. In fact for this mail I used an OpenMPI version just installed with the --enable-debug flag.

From what I can tell from stepping through the debugger the problem happens in btl_openib_component_init:

#0 btl_openib_component_init (num_btl_modules=0x7fff7a8593e8, enable_progress_threads=false, enable_mpi_threads=false) at /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/openib/btl_openib_component.c:2099 #1 0x00002b9eb6f65679 in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /home/kraused/ompi/openmpi-1.4/ompi/mca/btl/base/btl_base_select.c:110 #2 0x00002aaad007d933 in mca_bml_r2_component_init (priority=0x7fff7a8594b4, enable_progress_threads=false, enable_mpi_threads=false)
    at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/r2/bml_r2_component.c:86
#3 0x00002b9eb6f64a80 in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false)
    at /home/kraused/ompi/openmpi-1.4/ompi/mca/bml/base/bml_base_init.c:69
#4 0x00002aaacfc5580a in mca_pml_ob1_component_init (priority=0x7fff7a8595d0, enable_progress_threads=false, enable_mpi_threads=false) at /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/ob1/pml_ob1_component.c:168 #5 0x00002b9eb6f787a4 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at /home/kraused/ompi/openmpi-1.4/ompi/mca/pml/base/pml_base_select.c:126 #6 0x00002b9eb6ef4989 in ompi_mpi_init (argc=1, argv=0x7fff7a859af8, requested=0, provided=0x7fff7a859858)
    at /home/kraused/ompi/openmpi-1.4/ompi/runtime/ompi_mpi_init.c:534
#7 0x00002b9eb6f33bb2 in PMPI_Init (argc=0x7fff7a8598cc, argv=0x7fff7a8598c0) at /home/kraused/ompi/openmpi-1.4/ompi/mpi/c/profile/pinit.c:80 #8 0x00000000004007e6 in main (argc=1, argv=0x7fff7a859af8) at /home/kraused/blas.c:20

When I set a breakpoint in btl_openib_component_init and continue from there I get a SIGILL but the backtrace is meaningless to me:

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 0x40901940 (LWP 21183)]
0x00007fff23b2a7c0 in ?? ()
(gdb) bt
#0  0x00007fff23b2a7c0 in ?? ()
#1  0x0000003df9c06307 in start_thread () from /lib64/libpthread.so.0
#2  0x0000003df90d1ded in clone () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()


The bad thing is: If I step through btl_openib_component_init right after the call to ompi_btl_openib_fd_init and continue from there the program finishes.

More precisely: stepping beyond the pthread_create call at line 537 in btl_openib_fd.c and afterwards I can continue. I conjecture that gdb influences the threading here and therefore the problem doesn't show up?!

I'm interested in digging further but I need some advices/hints where to go from here.

Thanks,
Dorian


On 1/19/10 1:29 PM, Jeff Squyres wrote:
Can you get a core dump, or otherwise see exactly where the seg fault is 
occurring?

On Jan 18, 2010, at 8:34 AM, Dorian Krause wrote:

Hi Eloi,
Does the segmentation faults you're facing also happen in a sequential
environment (i.e. not linked against openmpi libraries) ?
No, without MPI everything works fine. Also, linking against mvapich
doesn't give any errors. I think there is a problem with GotoBLAS and
the shared library infrastructure of OpenMPI. The code doesn't come to
the point to execute the gemm operation at all.

Have you already informed Kazushige Goto (developer of Gotoblas) ?
Not yet. Since the problem only happens with openmpi and the BLAS
(stand-alone) seems to work, I thought the openmpi mailing list would be
the better place to discuss this (to get a grasp of what the error could
be before going to the GotoBLAS mailing list).

Regards,
Eloi

PS: Could you post your Makefile.rule here so that we could check the
different compilation options chosen ?
I didn't make any changes to the Makefile.rules. This is the content of
Makefile.conf:

OSNAME=Linux
ARCH=x86_64
C_COMPILER=GCC
BINARY32=
BINARY64=1
CEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
-L/lib/../lib64 -L/usr/lib/../lib64  -lc
F_COMPILER=GFORTRAN
FC=gfortran
BU=_
FEXTRALIB=-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2
-L/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64
-L/lib/../lib64 -L/usr/lib/../lib64  -lgfortran -lm -lgfortran -lm -lc
CORE=BARCELONA
LIBCORE=barcelona
NUM_CORES=8
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSE4A=1
HAVE_3DNOWEX=1
HAVE_3DNOW=1
MAKE += -j 8
SGEMM_UNROLL_M=8
SGEMM_UNROLL_N=4
DGEMM_UNROLL_M=4
DGEMM_UNROLL_N=4
QGEMM_UNROLL_M=2
QGEMM_UNROLL_N=2
CGEMM_UNROLL_M=4
CGEMM_UNROLL_N=2
ZGEMM_UNROLL_M=2
ZGEMM_UNROLL_N=2
XGEMM_UNROLL_M=1
XGEMM_UNROLL_N=1


Thanks,
Dorian

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to