On Tue, Oct 20, 2009 at 12:01 AM, Ian Clarke <ian.clarke at gmail.com> wrote: > It would be interesting to try to build up a picture of what exactly > is happening while we wait for a data request to come back with a > response. ?We could presumably do this by collecting statistics on one > or more nodes, tracking stuff like: > > ?- Time from a request entering to exiting > ?- Time for a request existing to generate a response relative to its > TTL and anything else that might indicate where it is in its request > chain > > From this we could narrow down where delays are coming from, is it > throughput, or perhaps just long request chains? > > Has anything like this been done yet?
Not in great detail. You can get some info about request chain lengths by looking at the hourly statistics on request counts and success rates by HTL. Roughly: more requests succeed quickly than slowly, comparable numbers succeed and fail. (Don't forget to correct for observational biases when looking at data. See my previous emails / flog posts. Note also that some of the numbers on your stats page lie, see eg bug 3526.) I could analyze this, I suppose, but I find the question of "why don't requests succeed" more important. IMHO it doesn't make much sense to spend a lot of effort on latency until success rates are higher, especially since fixing success rates probably improves latency. Serious investigation of success rates is currently blocking on tracer requests (see bugs 3550, 3568); serious latency investigation probably blocks on the same bugs. Also, the answer depends on context: are you talking about CHK or SSK requests? Do you mean fetch time, or messaging app ping latency? For messaging apps using exclusively SSKs, we know what the answer is already. Ping times are dominated by things the client is responsible for, like polling intervals, in most clients -- they should leave requests running and get the ULPR benefits, rather than what eg Frost does of slowly rotating through boards. In messaging apps that don't make that mistake, latency is consistently measured in the 10-30s range (average; numbers to either side of this are not uncommon). That number is almost certainly dominated by the policy discussed in bug 3338. Evan Daniel