Some update on this issue. I've attached gdb to the crashing
application and I got:

-----
Program received signal SIGSEGV, Segmentation fault.
mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850,
hdr=0xd10e60) at pml_ob1_sendreq.c:1231
1231    pml_ob1_sendreq.c: No such file or directory.
        in pml_ob1_sendreq.c
(gdb) bt
#0  mca_pml_ob1_send_request_put (sendreq=0x130c480, btl=0xc49850,
hdr=0xd10e60) at pml_ob1_sendreq.c:1231
#1  0x00007fc55bf31693 in mca_btl_tcp_endpoint_recv_handler (sd=<value
optimized out>, flags=<value optimized out>, user=<value optimized
out>) at btl_tcp_endpoint.c:718
#2  0x00007fc55fff7de4 in event_process_active (base=0xc1daf0,
flags=2) at event.c:651
#3  opal_event_base_loop (base=0xc1daf0, flags=2) at event.c:823
#4  0x00007fc55ffe9ff1 in opal_progress () at runtime/opal_progress.c:189
#5  0x00007fc55c9d7115 in opal_condition_wait (addr=<value optimized
out>, count=<value optimized out>, datatype=<value optimized out>,
src=<value optimized out>, tag=<value optimized out>,
    comm=<value optimized out>, status=0xcc6100) at
../../../../opal/threads/condition.h:99
#6  ompi_request_wait_completion (addr=<value optimized out>,
count=<value optimized out>, datatype=<value optimized out>,
src=<value optimized out>, tag=<value optimized out>,
    comm=<value optimized out>, status=0xcc6100) at
../../../../ompi/request/request.h:375
#7  mca_pml_ob1_recv (addr=<value optimized out>, count=<value
optimized out>, datatype=<value optimized out>, src=<value optimized
out>, tag=<value optimized out>, comm=<value optimized out>,
    status=0xcc6100) at pml_ob1_irecv.c:104
#8  0x00007fc560511260 in PMPI_Recv (buf=0x0, count=12884048,
type=0xd10410, source=-1, tag=0, comm=0xd0daa0, status=0xcc6100) at
precv.c:75
#9  0x000000000049cc43 in BI_Srecv ()
#10 0x000000000049c555 in BI_IdringBR ()
#11 0x0000000000495ba1 in ilp64_Cdgebr2d ()
#12 0x000000000047ffa0 in Cdgebr2d ()
#13 0x00007fc5621da8e1 in PB_CInV2 () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
#14 0x00007fc56220289c in PB_CpgemmAB () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
#15 0x00007fc5622b28fd in pdgemm_ () from
/home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so
-----

So this looks like the line responsible for segmentation fault is:
mca_bml_base_endpoint_t *bml_endpoint = sendreq->req_endpoint;

I repeated it several times: always crashes in the same line.

I have no idea what to do with this. Again, any help would be appreciated.

Thanks,
Grzegorz Maj



2010/12/6 Grzegorz Maj <ma...@wp.pl>:
> Hi,
> I'm using mkl scalapack in my project. Recently, I was trying to run
> my application on new set of nodes. Unfortunately, when I try to
> execute more than about 20 processes, I get segmentation fault.
>
> [compn7:03552] *** Process received signal ***
> [compn7:03552] Signal: Segmentation fault (11)
> [compn7:03552] Signal code: Address not mapped (1)
> [compn7:03552] Failing at address: 0x20b2e68
> [compn7:03552] [ 0] /lib64/libpthread.so.0(+0xf3c0) [0x7f46e0fc33c0]
> [compn7:03552] [ 1]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0xd577)
> [0x7f46dd093577]
> [compn7:03552] [ 2]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x5b4c)
> [0x7f46dc5edb4c]
> [compn7:03552] [ 3]
> /home/gmaj/lib/openmpi/lib/libopen-pal.so.0(+0x1dbe8) [0x7f46e0679be8]
> [compn7:03552] [ 4]
> (home/gmaj/lib/openmpi/lib/libopen-pal.so.0(opal_progress+0xa1)
> [0x7f46e066dbf1]
> [compn7:03552] [ 5]
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5945)
> [0x7f46dd08b945]
> [compn7:03552] [ 6]
> /home/gmaj/lib/openmpi/lib/libmpi.so.0(MPI_Send+0x6a) [0x7f46e0b4f10a]
> [compn7:03552] [ 7] /home/gmaj/matrix/matrix(BI_Ssend+0x21) [0x49cc11]
> [compn7:03552] [ 8] /home/gmaj/matrix/matrix(BI_IdringBR+0x79) [0x49c579]
> [compn7:03552] [ 9] /home/gmaj/matrix/matrix(ilp64_Cdgebr2d+0x221) [0x495bb1]
> [compn7:03552] [10] /home/gmaj/matrix/matrix(Cdgebr2d+0xd0) [0x47ffb0]
> [compn7:03552] [11]
> /home/gmaj/lib/intel_mkl/current/lib/em64t/libmkl_scalapack_ilp64.so(PB_CInV2+0x1304)
> [0x7f46e27f5124]
> [compn7:03552] *** End of error message ***
>
> This error appears during some scalapack computation. My processes do
> some mpi communication before this error appears.
>
> I found out, that by modifying btl_tcp_eager_limit and
> btl_tcp_max_send_size parameters, I can run more processes - the
> smaller those values are, the more processes I can run. Unfortunately,
> by this method I've succeeded to run up to 30 processes, which is
> still far to small.
>
> Some clue may be what valgrind says:
>
> ==3894== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==3894==    at 0x82D009B: writev (in /lib64/libc-2.12.90.so)
> ==3894==    by 0xBA2136D: mca_btl_tcp_frag_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894==    by 0xBA203D0: mca_btl_tcp_endpoint_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894==    by 0xB003583: mca_pml_ob1_send_request_start_rdma (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894==    by 0xAFFA7C9: mca_pml_ob1_send (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894==    by 0x6D4B109: PMPI_Send (in 
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0x49CC10: BI_Ssend (in /home/gmaj/matrix/matrix)
> ==3894==    by 0x49C578: BI_IdringBR (in /home/gmaj/matrix/matrix)
> ==3894==    by 0x495BB0: ilp64_Cdgebr2d (in /home/gmaj/matrix/matrix)
> ==3894==    by 0x47FFAF: Cdgebr2d (in /home/gmaj/matrix/matrix)
> ==3894==    by 0x51B38E0: PB_CInV2 (in
> /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
> ==3894==    by 0x51DB89B: PB_CpgemmAB (in
> /home/gmaj/lib/intel_mkl/10.2.6/lib/em64t/libmkl_scalapack_ilp64.so)
> ==3894==  Address 0xadecdce is 461,886 bytes inside a block of size
> 527,544 alloc'd
> ==3894==    at 0x4C2615D: malloc (vg_replace_malloc.c:195)
> ==3894==    by 0x6D0BBA3: ompi_free_list_grow (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0xBA1E1A4: mca_btl_tcp_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_btl_tcp.so)
> ==3894==    by 0x6D5C909: mca_btl_base_select (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0xB40E950: mca_bml_r2_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_bml_r2.so)
> ==3894==    by 0x6D5C07E: mca_bml_base_init (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0xAFF8A0E: mca_pml_ob1_component_init (in
> /home/gmaj/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
> ==3894==    by 0x6D663B2: mca_pml_base_select (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0x6D25D20: ompi_mpi_init (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0x6D45987: PMPI_Init_thread (in
> /home/gmaj/lib/openmpi/lib/libmpi.so.0)
> ==3894==    by 0x42490A: MPI::Init_thread(int&, char**&, int)
> (functions_inln.h:150)
> ==3894==    by 0x41F483: main (matrix.cpp:83)
>
> I've tried to configure open-mpi with option --without-memory-manager,
> but it didn't help.
>
> I can successfully run exactly the same application on other machines,
> having the number of nodes even over 800.
>
> Does anyone have any idea how to further debug this issue? Any help
> would be appreciated.
>
> Thanks,
> Grzegorz Maj
>

Reply via email to