On Tue, Oct 20, 2009 at 12:01 AM, Ian Clarke <ian.clarke at gmail.com> wrote:
> It would be interesting to try to build up a picture of what exactly
> is happening while we wait for a data request to come back with a
> response. ?We could presumably do this by collecting statistics on one
> or more nodes, tracking stuff like:
>
> ?- Time from a request entering to exiting
> ?- Time for a request existing to generate a response relative to its
> TTL and anything else that might indicate where it is in its request
> chain
>
> From this we could narrow down where delays are coming from, is it
> throughput, or perhaps just long request chains?
>
> Has anything like this been done yet?

Not in great detail.

You can get some info about request chain lengths by looking at the
hourly statistics on request counts and success rates by HTL.
Roughly: more requests succeed quickly than slowly, comparable numbers
succeed and fail.  (Don't forget to correct for observational biases
when looking at data.  See my previous emails / flog posts.  Note also
that some of the numbers on your stats page lie, see eg bug 3526.)

I could analyze this, I suppose, but I find the question of "why don't
requests succeed" more important.  IMHO it doesn't make much sense to
spend a lot of effort on latency until success rates are higher,
especially since fixing success rates probably improves latency.
Serious investigation of success rates is currently blocking on tracer
requests (see bugs 3550, 3568); serious latency investigation probably
blocks on the same bugs.

Also, the answer depends on context: are you talking about CHK or SSK
requests?  Do you mean fetch time, or messaging app ping latency?  For
messaging apps using exclusively SSKs, we know what the answer is
already.  Ping times are dominated by things the client is responsible
for, like polling intervals, in most clients -- they should leave
requests running and get the ULPR benefits, rather than what eg Frost
does of slowly rotating through boards.  In messaging apps that don't
make that mistake, latency is consistently measured in the 10-30s
range (average; numbers to either side of this are not uncommon).
That number is almost certainly dominated by the policy discussed in
bug 3338.

Evan Daniel

Reply via email to