Hi! 
  I'm using OpenMPI 1.3 on 30 nodes connected with Gigabit Ethernet on Redhat 
Linux x86_64. 

Our MPI job sometimes hang and show follow error logs:



 [btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv  
failed: Connection timed out (110)



I run a test like this: write a hello world program, send "helloworld" from 
rank 0 to rank 1,  and modified the recv() return value at 
btl_tcp_frag.c:mca_btl_tcp_frag_recv() , force the readv return value cnt 
equals to -1, and rebuild openmpi and change the dynamic libs, then run the 
helloworld, the MPI job hang at MPI_Recv().

I have the follow questions:

     Does OpenMPI support check the btl tcp network error, such as readv or 
writev failed ? I found mca_btl_tcp_endpoint_recv_handler() at btl layer 
couldn't return the error stat to PML, how could I made it?



how could MPI_Send, MPI_Isend, MPI_Recv, MPI_Irecv detect those error and avoid 
hang ?


thanks a lot!



_________________________________________________________________
一张照片的自白――Windows Live照片的可爱视频介绍
http://windowslivesky.spaces.live.com/blog/cns!5892B6048E2498BD!889.entry

Reply via email to