Argh. It looks like a lot of the variability (and optimistic figures) of the below are caused by my looking at all requests rather than local requests. Local requests obviously are vastly fewer in number... but they give full-request times.
These are my current times from the stats page: Successful 40.870s Unsuccessful 11.796s Average 12.106s However these seem to be more variable even than the remote averages, probably due to different content fetched by different nodes being harder to find and taking more hops... E.g. nextgens' is 26 10 25, TheSeeker was 30 11 14. So it looks like we'll have to use both ... Recomputing the turtling table below based on my local requests over the same period (midnight to 11:00): $ zgrep "Successful CHK fetch took" fast/logs-dark/freenet-1197-2009-01-14-0* fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK fetch took //" > local-only-timings.list Cutoff % reduction in mean request time % of requests turtled 15s 84% 59% 30s 73% 37% 45s 65% 22% 60s 59% 16% 90s 51% 7.6% 120s 46% 4.4% On the other hand, on average a request going through a given node must be half way through its request ... so we could reasonably just double the timings, if my node was a representative sample of the requests on the overall network... on the theory that the likelihood of a node being really slow is doubled for twice the number of hops, and that the search time is proportional to the number of hops too. Cutoff % reduction in mean request time % of requests turtled 30s 74-76% 41-42% 60s 56-59% 22-23% 90s 42-48% 12-14% 120s 31-39% 7-9% These are actually reasonably similar, especially in the 30 and 60 second slots... Both show about a 50% gain if we turtle anything above 90 seconds... Obviously if we implement turtling we should show the proportion of local (and remote) requests that get a transfer failure, and if possible the proportion that are offered the key afterwards... So the proposed course of action: Turtle anything over 90 seconds. Show more stats on the web interface, disableable with a config option. Show the probability of a transfer failure, and the proportion of transfer failures that result in the key being offered and fetched, or found some other way, within say 20 minutes. One problem with the above is they cut out the samples over the threshold rather than replacing them with the threshold ... Correcting the local timings table for this: Cutoff % reduction in mean request time % of requests turtled 30s 45% 37% 60s 25% 16% 90s 15% 7.6% 120s 10% 4.4% The doubled remote timings table: Cutoff % reduction in mean request time % of requests turtled 30s 59% 41-42% 60s 40% 22-23% 90s 28% 12-14% 120s 9% 7-9% Hence if we use 30 seconds we should get a 45-59% gain, but at the cost of turtling around 40% of requests; at 60 seconds, the gain may not be large enough to be easily detectable. BUT we don't have to abandon a transfer after a set number of seconds: we can for example switch it to turtle mode after a block takes more than N seconds to transfer. If we set this to say 3, we should see no non-turtle transfers over 90 seconds, and most much less than that, a reduction in mean time on the order of 50%, and hopefully not too many requests turtled. Probably worth trying ... it may make sense to do some load management changes first... On Wednesday 14 January 2009 13:57, Matthew Toseland wrote: > At the moment, having taken out the recent latency optimisation changes (that > resulted in a massive cut in bandwidth usage), latency is way up: > - Median CHK request time 11.2 seconds. > - Mean 22-23 seconds. > - 41-42% of requests take more than 15 seconds to complete. > - 22-23% of requests take more than 30 seconds to complete. > - 7-9% of requests take more than 60 seconds to complete. > > These figures are based on a sample of approx 11 hours overnight, after it > became mandatory (may include some UOM), and a sample of half an hour around > 12ish. The two agree very closely. TheSeeker's node shows a 13 second median > and a 27 second mean. You can get similar results by setting log level > details to freenet.node.RequestSender:MINOR, then: > > Just follow the internally updated median/mean: > $ tail --follow=name --retry fast/logs-dark/freenet-latest.log | > grep "Successful CHK request took" > > Grep for individual timings: > $ zgrep "Successful CHK request took" > fast/logs-dark/freenet-1197-2009-01-14-0* > fast/logs-dark/freenet-1197-2009-01-14-10-* | sed "s/^.*Successful CHK > request took //" | sed "s/ average Median.*$//" > timings2.list > > Sort them and view them in less to get percentiles etc: > $ cat timings.list | sort -n | less > (Use the -N option to show line numbers) > > Get mean excluding outliers over some value: > $ cat timings.list | (total=0; count=0; while read x; do if test $x -gt 30000; > then echo Over 30 seconds: $x; else count=$((count+1)); total=$((total+x)); > fi; done; echo Total is $total count is $count average is $(( $total / > $count ))) > > > Yesterday (1196, transfer backoff and Saturday's throttling), these stats were > a 4 second median and 8 second mean. The 90th percentile was 15-17 seconds > yesterday and is 50-57 seconds today. > > However on Tuesday (1195, Saturday's throttling but not transfer backoff), it > was more like a 3 second median and a mean fluctuating a lot due to some high > values every now and then, around 13 seconds later on when there was more > data. Of course there are time of day effects. :| > > The main result of yesterday's testing (transfer backoff on transfers taking > more than 15 seconds) was that there was a vast amount of backoff, and even > lower bandwidth usage than tuesday, presumably because lots of nodes are > affected by a single slow transfer. Users reported less than half their > backoff was due to transfer backoff, otoh ... it was over half for me for a > while, but it reduced as a proportion over the day. > > We could cut the average CHK request time significantly at the cost of a > somewhat smaller proportion of requests failing at a given threshold and > having to continue on the last hop only as a turtle-request; when the > transfer completes, we would offer it to the nodes that have asked for it in > the past. > > Cutoff % reduction in mean request time % of requests > turtled > 15s 74-76% 41-42% > 30s 56-59% 22-23% > 45s 42-48% 12-14% > 60s 31-39% 7-9% > > Obviously whatever proportion of requests are turtled, the fproxy psuccess is > likely to be reduced by that much. :| OTOH it shouldn't affect queued > requests much. > > IMHO the system is over-optimised for throughput at the moment. The fact that > the mean didn't decrease on Tuesday (although some users are seeing much > higher figures than the above quotes, probably transient though) is probably > due to outliers perhaps related to the significant backoff resulting from the > over-aggressive solutions I have tried so far. With Saturday's limiting > turned off, the main limiter on the number of requests a node accepts is > output bandwidth liability limiting, which works on the principle of assuming > that every request in flight will succeed, and working out how many can be > accepted if they must all complete in 90 seconds. We could probably reduce > this to 60 without a significant adverse effect on bandwidth usage. > Saturday's limiting works similarly but uses the average bytes used for a > request i.e. it takes the short-term psuccess into account. It has a much > shorter threshold (5 seconds), and doesn't try to compensate for overheads. > It might be interesting to reinstate this with a much higher threshold (20 > seconds??). Hopefully the combination would make the above table more > attractive: if the last column's values could be halved, for example, without > severely impacting on bandwidth usage, the combination would be very > attractive. IMHO turtling support (or at least much stricter transfer > timeouts) is necessary for reasons of attack resistance; and the current > proposal (in a previous mail) incorporates the best part of Ian's transfer > backoff without flooding the network with backoff. > > A last resort would be a bulk vs realtime flag on requests. Bulk requests > could be handled separately from real-time requests. Real-time requests would > have a higher transfer priority, but would be limited to some proportion of > the overall bandwidth usage, would only tolerate fast transfers, and in > future might be routed more quickly / queued for less time (and therefore to > a less ideal target). Bulk requests would be optimised for psuccess primarily > and then for throughput, tolerating reasonably long transfer times (but not > the 48 minutes theoretically possible now!). This has been suggested in the > past, obviously it costs us some request indistinguishability, but maybe the > time for it is soon. Anyway a proper proposal would need to be fleshed out. > Arguably ULPRs obsolete bulk requests. > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 827 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20090114/62c3892f/attachment.pgp>