Steve, We have not seen this hang before. Not sure what is happening at this point. I will try to see through the code for this behavior.
btw, mvapich2-0.9.8-p2 and the ofa mvapich2 code are identical at this point. --Sundeep. On Fri, 1 Jun 2007, Steve Wise wrote: > Sundeep/Sean, > > I'm helping a customer who is trying to run mvapich2 over chelsio's > rnic. They're running a simple program that does an mpi init, 1000 > barriers, then a finalize. They're using ofed-1.2-rc3, mpiexec-0.82, > and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit). Also they > aren't using mpd to start up stuff. They're using pmi I guess (I'm not > sure what pmi is, but the mpiexec has -comm=pmi. BTW: I can run the > same program fine on my 8 node cluster using mpd and the ofa mvapich2 code. > > On their cluster a 4 node/4 process job hangs in finalize almost always. > When it hangs, one process is always stuck in rdma_destroy_id(). > > Here's the stack: > > (gdb) bt > #0 0x0000003c7cf0ae2b in __lll_mutex_lock_wait () from > /lib64/tls/libpthread.so.0 > #1 0x000000000068db20 in ?? () > #2 0x0000000060040a0a in ?? () > #3 0x0000003c7cf08800 in pthread_cond_destroy@@GLIBC_2.3.2 () from > /lib64/tls/libpthread.so.0 > #4 0x0000002a9579a09c in ucma_destroy_kern_id (fd=0, handle=6871424) at > src/cma.c:403 > #5 0x0000002a9579a163 in rdma_destroy_id (id=0x68d980) at src/cma.c:425 > #6 0x0000000000423ef9 in ib_finalize_rdma_cm () > #7 0x00000000004183f6 in MPIDI_CH3I_CM_Finalize () > #8 0x000000000044b03b in MPIDI_CH3_Finalize () > #9 0x000000000043169e in MPID_Finalize () > #10 0x000000000040c3ef in PMPI_Finalize () > #11 0x0000000000403af4 in main () > (gdb) > > I'm not sure I belive this stack trace fully, because > ucm_destroy_kern_id() doesn't call pthread_cond_destroy(). However > rdma_destroy_id() does. So I'm thinking that ucma_destroy_id() has > already been executed and rdma_destroy_id() is freeing the cm_id and we > get stuck in pthread_cond_destroy() destroying the pthread condition object. > > I'm wondering if ya'll have ever seen this kind of hang? I can kill the > process and it exits, so I don't think we're stuck down in the > kernel IWCM or anything. > > Any thoughts? > > Thanks, > > Steve. > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
