> This ensures that naïve IB applications cannot overwhelm the SA with
> queries, which could happen when a cluster is being rebooted, or when a
> large HPC application is started.

I don't object to the concept of treating a busy response as a timeout, but how 
does this help prevent overwhelming the SA?  It continues to retry the queries, 
even if the SA says that it's too busy to respond without adjusting the timeout 
specified by the user.  I would think that you'd at least want to adjust the 
timeout (double it or use some random backoff).

The general guideline that we've been using for adjusting timeouts has been to 
report the failures and let the caller make the a necessary adjustments.  As 
far as I know, the only way for user space applications to query the SA are 
through the librdmacm, which sets retries to 0, or through the libibumad 
interface directly.  I would expect any application using the latter to be 
intelligent enough to handle a busy response.

Maybe we should re-think that guideline and allow users to simply indicate that 
the MAD layer should use reasonable defaults.  This would enable the ib_mad 
module to adjust the timeout values for all consumers based on actual 
destination response times.  It could also back off retrying multiple requests 
that were initiated around the same time, instead only retrying the first 
request, while simply increasing the timeout values for the others.  This is 
more complex, but we should be able to start with something fairly simple.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to