Sean said, > Because applications may handle BUSY replies differently, we shouldn't simply > start hiding them from the user.
Sean - remember that this patch will still return a BUSY status to the caller, if retries are exhausted and the last return code was BUSY, then that's what the caller will get. Thus, code which sets retries to zero will not be affected by this patch at all. Hal said, > All I was getting at here was: does retrying when busy work ? If not, > why retry at all at the MAD layer (regardless of retries requested) > and perhaps use a longer timeout for this. If it does work, maybe the > timeout on the subsequent retries should be extended. Personally, I think it's been extremely helpful - we've been using busy status to tell compute nodes to slow down since our old proprietary stack and we've seen a significant improvement in overall traffic congestion when we added this patch to OFED clusters using our SM. In addition use of the BUSY return code simplifies debugging traffic congestion problems (since it allows you to immediately differentiate between SA overload and other traffic issues) and it paves the way for more sophisticated back-off strategies in the future. As to that, and your question, our old stack used two different timeout values specified by the client. One value was for actual timeouts and one for busy responses. In the case of busy responses, we added a randomization factor to spread out the traffic. This issue with adapting that to the Linux-RDMA stack is that it's an API change. What I would suggest personally, is something like this: 1. Take either the timeout passed by the caller OR a predefined constant, whichever is larger. I would suggest setting the predefined constant to something moderate, say 2 seconds. 2. Add a randomization factor - say between -250 and +250 ms? 3. Update the packet timeout with this new value. N�����r��y����b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�m��������zZ+�����ݢj"��!�i