we are having recently problems running trunk with openib component enabled on one of our clusters. The problem occurs right in the initialization part, here is the stack right before the segfault:

---snip---
(gdb) where
#0 mca_btl_openib_tune_endpoint (openib_btl=0x762a40, endpoint=0x7d9660) at btl_openib.c:470 #1 0x00007f1062f105c4 in mca_btl_openib_add_procs (btl=0x762a40, nprocs=2, procs=0x759be0, peers=0x762440, reachable=0x7fff22dd16f0) at btl_openib.c:1093 #2 0x00007f106316102c in mca_bml_r2_add_procs (nprocs=2, procs=0x759be0, reachable=0x7fff22dd16f0) at bml_r2.c:201 #3 0x00007f10615c0dd5 in mca_pml_ob1_add_procs (procs=0x70dc00, nprocs=2) at pml_ob1.c:334 #4 0x00007f106823ed84 in ompi_mpi_init (argc=1, argv=0x7fff22dd1da8, requested=0, provided=0x7fff22dd184c) at runtime/ompi_mpi_init.c:790 #5 0x00007f1068273a2c in MPI_Init (argc=0x7fff22dd188c, argv=0x7fff22dd1880) at init.c:84 #6 0x00000000004008e7 in main (argc=1, argv=0x7fff22dd1da8) at hello_world.c:13
---snip---


in line 538 of the file containing the mca_btl_openib_tune_endpoint routine, the strcmp operation fails, because recv_qps is a NULL pointer.


---snip---

if(0 != strcmp(mca_btl_openib_component.receive_queues, recv_qps)) {

---snip---

Does anybody have an idea on what might be going wrong and how to resolve it? Just to confirm, everything works perfectly with the 1.8 series on that very same cluster

Thanks
Edgar

Reply via email to