Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x00007ffff711d885 in raise () from /lib64/libc.so.6
#1  0x00007ffff711f065 in abort () from /lib64/libc.so.6
#2  0x00007ffff0387b8d in psmi_errhandler_psm (ep=<value optimized out>,
    err=PSM_INTERNAL_ERR, error_string=<value optimized out>,
    token=<value optimized out>) at psm_error.c:76
#3  0x00007ffff0387df1 in psmi_handle_error (ep=0xfffffffffffffffe,
    error=PSM_INTERNAL_ERR, buf=<value optimized out>) at psm_error.c:154
#4  0x00007ffff0382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffffffc6a0,
    args=0x7fffed0461d0, narg=<value optimized out>,
    buf=<value optimized out>, len=<value optimized out>) at ptl.c:200
#5  0x00007ffff037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
    isreq=<value optimized out>) at am_reqrep_shmem.c:2164
#6  0x00007ffff037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
    at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x00007ffff03a0329 in __psmi_poll_internal (ep=0x737538,
    poll_amsh=<value optimized out>) at psm.c:465
#9  0x00007ffff039f0af in psmi_mq_wait_inner (ireq=0x7fffffffc848)
    at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffffffc848) at psm_mq.c:334
#11 0x00007ffff037db21 in amsh_mq_send_inner (ptl=0x737818,
    mq=<value optimized out>, epaddr=0x6eb418, flags=<value optimized out>,
    tag=844424930131968, ubuf=0x1308350000, len=32768)
---Type <return> to continue, or q <return> to quit---
    at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=<value optimized out>, epaddr=0x6eb418,
    flags=<value optimized out>, tag=844424930131968, ubuf=0x1308350000,
    len=32768) at am_reqrep_shmem.c:2387
#13 0x00007ffff039ed71 in __psm_mq_send (mq=<value optimized out>,
    dest=<value optimized out>, flags=<value optimized out>,
    stag=<value optimized out>, buf=<value optimized out>,
    len=<value optimized out>) at psm_mq.c:413
#14 0x00007ffff05c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x00007ffff1eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x00007ffff79253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x00000000004045ef in ExchangeHalos (cartComm=0x715460,
    devSend=0x1308350000, hostSend=0x7b8710, hostRecv=0x7c0720,
    devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x00000000004033d8 in TransferAllHalos (cartComm=0x715460,
    domSize=0x7fffffffcd80, topIndex=0x7fffffffcd60, neighbors=0x7fffffffcd90,
    copyStream=0xaa4450, devBlocks=0x7fffffffcd30,
    devSideEdges=0x7fffffffcd20, devHaloLines=0x7fffffffcd10,
    hostSendLines=0x7fffffffcd00, hostRecvLines=0x7fffffffccf0) at Host.c:400
#19 0x000000000040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type <return> to continue, or q <return> to quit---
    domSize=0x7fffffffcd80, topIndex=0x7fffffffcd60, neighbors=0x7fffffffcd90,
    useFastSwap=0, devBlocks=0x7fffffffcd30, devSideEdges=0x7fffffffcd20,
    devHaloLines=0x7fffffffcd10, hostSendLines=0x7fffffffcd00,
    hostRecvLines=0x7fffffffccf0, devResidue=0x1310480000,
    copyStream=0xaa4450, iterations=0x7fffffffcd44,
    avgTransferTime=0x7fffffffcd48) at Host.c:466
#20 0x0000000000401ccb in main (argc=4, argv=0x7fffffffcea8) at Jacobi.c:60

Pierre.


________________________________
De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
    Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
    jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.      Backtrace:
    /gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.


Reply via email to