>
>    Changqing> Is there a common recommended value for this timeout ?
>    Changqing> I use 18, which represents 1 second.
>
>18 should be OK I guess, unless you have congestion in your 
>fabric, in which case you have other problems anyway.
>
>    Changqing> It is very hard to reproduce this error with standalone
>    Changqing> code. I use HP-Mpi and need 8 ranks, at least 4 nodes
>    Changqing> with 2 cards on each node, and just one of our hundred
>    Changqing> test code can catch this error, and it is on
>    Changqing> MPI_Scatterv Operation.
>
>Unless you can narrow down a way to reproduce this, I don't 
>think it's going to be possible for anyone to help debug it.

OK, I forget to mention, if I use rdma on both channels, it is hard to
reprocude the hang,
If I create SRQ on one of the channel, then it hangs the other channel
even on the first
Rdma operation, I will write a standlone code for you driver guys to
debug.


--CQ


>
> - R.
>

_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to