> A common method for handling this sort of thing is to randomize > the retry timeout. It would be a good idea to randomize all timeouts, > but the BUSY replies should probably randomize over a longer time > period. > > Randomization prevents nodes in the cluster from self-synchronizing > and making the load on the SA worse.
I agree that randomization would be nice, but I think we want even more than that. Part of the issues that we've seen with the current implementation is that when a large HPC job starts, everyone and their dog sends the SA a query. These time out around the same time and get resent, and the SA ends up processing a huge number of duplicates. The mad layer could be a lot more intelligent and avoid sending more than a handful (1?) of retries (or even initial requests) at a time until some complete. - Sean -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html